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Preface 


The original motivation for writing this book was rather personal. The first 
author, in the course of his teaching career in the Department of Pure Math- 
ematics and Mathematical Statistics (DPMMS), University of Cambridge, 
and St John’s College, Cambridge, had many painful experiences when good 
(or even brilliant) students, who were interested in the subject of mathemat- 
ics and its applications and who performed well during their first academic 
year, stumbled or nearly failed in the exams. This led to great frustration, 
which was very hard to overcome in subsequent undergraduate years. A con- 
scientious tutor is always sympathetic to such misfortunes, but even pointing 
out a student’s obvious weaknesses (if any) does not always help. For the 
second author, such experiences were as a parent of a Cambridge University 
student rather than as a teacher. 

We therefore felt that a monograph focusing on Cambridge University 
mathematics examination questions would be beneficial for a number of 
students. Given our own research and teaching backgrounds, it was natural 
for us to select probability and statistics as the overall topic. The obvious 
starting point was the first-year course in probability and the second-year 
course in statistics. In order to cover other courses, several further volumes 
will be needed; for better or worse, we have decided to embark on such a 
project. 

Thus our essential aim is to present the Cambridge University probability 
and statistics courses by means of examination (and examination-related) 
questions that have been set over a number of past years. Of course, 
Cambridge University examinations have never been easy. On the basis 
of examination results, candidates are divided into classes: first, second 
(divided into two categories: 2.1 and 2.2) and third; a small number of 
candidates fail. (In fact, a more detailed list ranking all the candidates 
in order is produced, but not publicly disclosed.) The examinations are 
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officially called the ‘Mathematical Tripos’, after the three-legged stools on 
which candidates and examiners used to sit (sometimes for hours) during 
oral examinations in ancient times. Nowadays all examinations are written. 
The first year of the three-year undergraduate course is called Part IA, the 
second Part IB and the third Part II. 

For example, in May—June of 2003 the first-year mathematics students sat 
four examination papers; each lasted three hours and included 12 questions 
from two subjects. The following courses were examined: algebra and geom- 
etry, numbers and sets, analysis, probability, differential equations, vector 
calculus, and dynamics. All questions on a given course were put in a sin- 
gle paper, except for algebra and geometry, which appear in two papers. In 
each paper, four questions were classified as short (two from each of the two 
courses selected for the paper) and eight as long (four from each selected 
course). A candidate might attempt all four short questions and at most 
five long questions, no more than three on each course; a long question car- 
ries twice the credit of a short one. A calculation shows that if a student 
attempts all nine allowed questions (which is often the case), and the time 
is distributed evenly, a short question must be completed in 12-13 minutes 
and a long one in 24-25 minutes. This is not easy and usually requires spe- 
cial practice; one of the goals of this book is to assist with such a training 
programme. 

The pattern of the second-year examinations has similarities but also dif- 
ferences. In June 2003, there were four IB Maths Tripos papers, each three 
hours long and containing nine or ten short and nine or ten long questions 
in as many subjects selected for a given paper. In particular, IB statistics 
was set in Papers 1, 2 and 4, giving a total of six questions. Of course, 
preparing for Part IB examinations is different from preparing for Part IA; 
we comment on some particular points in the corresponding chapters. 

For a typical Cambridge University student, specific preparation for the 
examinations begins in earnest during the Easter (or Summer) Term (be- 
ginning in mid-April). Ideally, the work might start during the preceding 
five-week vacation. (Some of the examination work for Parts IB and II, the 
computational projects, is done mainly during the summer vacation period.) 
As the examinations approach, the atmosphere in Cambridge can become 
rather tense and nervous, although many efforts are made to diffuse the 
tension. Many candidates expend a great deal of effort in trying to calculate 
exactly how much work to put into each given subject, depending on how 
much examination credit it carries and how strong or weak they feel in it, 
in order to optimise their overall performance. One can agree or disagree 
with this attitude, but one thing seemed clear to us: if the students receive 
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(and are able to digest) enough information about and insight into the level 
and style of the Tripos questions, they will have a much better chance of 
performing to the best of their abilities. At present, owing to great pressures 
on time and energy, most of them are not in a position to do so, and much 
is left to chance. We will be glad if this book helps to change this situation 
by alleviating pre-examination nerves and by stripping Tripos examinations 
of some of their mystery, at least in respect of the subjects treated here. 

Thus, the first reason for this book was a desire to make life easier for the 
students. However, in the course of working on the text, a second motiva- 
tion emerged, which we feel is of considerable professional interest to anyone 
teaching courses in probability and statistics. In 1991-2 there was a major 
change in Cambridge University to the whole approach to probabilistic and 
statistical courses. The most notable aspect of the new approach was that 
the IA Probability course and the IB Statistics course were redesigned to 
appeal to a wide audience (200 first-year students in the case of IA Proba- 
bility and nearly the same number of the second-year students in the case 
of IB Statistics). For a large number of students, these are the only courses 
from the whole of probability and statistics that they attend during their 
undergraduate years. Since more and more graduates in the modern world 
have to deal with theoretical and (especially) applied problems of a proba- 
bilistic or statistical nature, it is important that these courses generate and 
maintain a strong and wide appeal. The main goal shifted, moving from 
an academic introduction to the subject towards a more methodological 
approach which equips students with the tools needed to solve reasonable 
practical and theoretical questions in a ‘real life’ situation. 

Consequently, the emphasis in IA Probability moved further away from 
sigma-algebras, Lebesgue and Stieltjes integration and characteristic func- 
tions to a direct analysis of various models, both discrete and continuous, 
with the aim of preparing students both for future problems and for future 
courses (in particular, Part IB Statistics and Part IB/II Markov chains). In 
turn, in IB Statistics the focus shifted towards the most popular practical 
applications of estimators, hypothesis testing and regression. The princi- 
pal determination of examination performance in both IA Probability and 
IB Statistics became students’ ability to choose and analyse the right model 
and accurately perform a reasonable amount of calculation rather than their 
ability to solve theoretical problems. 

Certainly such changes (and parallel developments in other courses) were 
not always unanimously popular among the Cambridge University Faculty 
of Mathematics, and provoked considerable debate at times. However, the 
student community was in general very much in favour of the new approach, 
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and the ‘redesigned’ courses gained increased popularity both in terms of 
attendance and in terms of attempts at examination questions (which has 
become increasingly important in the life of the Faculty of Mathematics). 
In addition, with the ever-growing prevalence of computers, students have 
shown a strong preference for an ‘algorithmic’ style of lectures and exami- 
nation questions (at least in the authors’ experience). 

In this respect, the following experience by the first author may be of some 
interest. For some time I have questioned former St John’s mathematics 
graduates, who now have careers in a wide variety of different areas, about 
what parts of the Cambridge University course they now consider as most 
important for their present work. It turned out that the strongest impact on 
the majority of respondents is not related to particular facts, theorems, or 
proofs (although jokes by lecturers are well remembered long afterwards). 
Rather they appreciate the ability to construct a mathematical model which 
represents a real-life situation, and to solve it analytically or (more often) 
numerically. It must therefore be acknowledged that the new approach was 
rather timely. As a consequence of all this, the level and style of Maths Tripos 
questions underwent changes. It is strongly suggested (although perhaps it 
was not always achieved) that the questions should have a clear structure 
where candidates are led from one part to another. 

The second reason described above gives us hope that the book will be 
interesting for an audience outside Cambridge. In this regard, there is a 
natural question: what is the book’s place in the (long) list of textbooks on 
probability and statistics? Many of the references in the bibliography are 
books published in English after 1991, containing the terms ‘probability’ or 
‘statistics’ in their titles and available at the Cambridge University Main 
and Departmental Libraries (we are sure that our list is not complete and 
apologise for any omission). 

As far as basic probability is concerned, we would like to compare this 
book with three popular series of texts and problem books, one by S. 
Ross [120]-[125], another by D. Stirzaker [138]-[141] and the third by G. 
Grimmett and D. Stirzaker [62]—|64]. The books by Ross and Stirzaker are 
commonly considered as a good introduction to the basics of the subject. In 
fact, the style and level of exposition followed by Ross has been adopted in 
many American universities. On the other hand, Grimmett and Stirzaker’s 
approach is at a much higher level and might be described as ‘professional’. 
The level of our book is intended to be somewhere in-between. In our view, 
it is closer to that of Ross or Stirzaker, but quite far away from them 
in several important aspects. It is our feeling that the level adopted by 
Ross or Stirzaker is not sufficient to get through Cambridge University 
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Mathematical Tripos examinations with Class 2.1 or above. Grimmett and 
Stirzaker’s books are of course more than enough — but in using them to 
prepare for an examination the main problem would be to select the right 
examples from among a thousand on offer. 


On the other hand, the above monographs, as well as many of the books 
from the bibliography, may be considered as good complementary reading for 
those who want to take further steps in a particular direction. We mention 
here just a few of them: [29], [41], [56], [60], [75], [131] and [26]. In any 
case, the (nostalgic) time when everyone learning probability had to read 
assiduously through the (excellent) two-volume Feller monograph [50] had 
long passed (though in our view, Feller has not so far been surpassed). 


In statistics, the picture is more complex. Even the definition of the sub- 
ject of statistics is still somewhat controversial (see Section 3.1). The style 
of lecturing and examining the basic statistics course (and other statistics- 
related courses) at Cambridge University was always rather special. This 
style resisted a trend of making the exposition ‘fully rigorous’, despite the 
fact that the course is taught to mathematics students. A minority of stu- 
dents found it difficult to follow, but for most of them this was never an 
issue. On the other hand, the level of rigour in the course is quite high and 
requires substantial mathematical knowledge. Among modern books, the 
closest to the Cambridge University style is perhaps [22]. As an example of 
a very different approach, we can point to [153] (whose style we personally 
admire very much but would not consider as appropriate for first reading or 
for preparing for Cambridge examinations). 


A particular feature of this book is that it contains repetitions: certain 
topics and questions appear more than once, often in slightly different form, 
which makes it difficult to refer to previous occurrences. This is of course 
a pattern of the examination process which becomes apparent when one 
considers it over a decade or so. Our personal attitudes here followed a 
proverb ‘Repetition is the mother of learning’, popular (in various forms) 
in several languages. However, we apologise to those readers who may find 
some (and possibly many) of these repetitions excessive. 


This book is organised as follows. In the first two chapters we present 
the material of the [A Probability course (which consists of 24 one-hour 
lectures). In this part the Tripos questions are placed within or immediately 
following the corresponding parts of the expository text. In Chapters 3 and 
4 we present the material from the 16-lecture IB Statistics course. Here, 
the Tripos questions tend to embrace a wider range of single topics, and 
we decided to keep them separate from the course material. However, the 
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various pieces of theory are always presented with a view to the rdéle they 
play in examination questions. 

A special word should be said about solutions in this book. In part, we 
use students’ solutions or our own solutions (in a few cases solutions are 
reduced to short answers or hints). However, a number of the so-called ex- 
aminers’ model solutions have also been used; these were originally set by 
the corresponding examiners and often altered by relevant lecturers and co- 
examiners. (A curious observation by many examiners is that, regardless of 
how perfect their model solutions are, it is rare that any of the candidates 
follow them.) Here, we aimed to present all solutions in a unified style; we 
also tried to correct mistakes occurring in these solutions. We should pay 
the highest credit to all past and present members of the DPMMS who con- 
tributed to the painstaking process of supplying model solutions to Tripos 
problems in IA Probability and IB Statistics: in our view their efforts defi- 
nitely deserve the deepest appreciation, and this book should be considered 
as a tribute to their individual and collective work. 

On the other hand, our experience shows that, curiously, students very 
rarely follow the ideas of model solutions proposed by lecturers, supervisors 
and examiners, however impeccable and elegant these solutions may be. 
Furthermore, students understand each other much more quickly than they 
understand their mentors. For that reason we tried to preserve whenever 
possible the style of students’ solutions throughout the whole book. 

Informal digressions scattered across the text have been in part borrowed 
from [38], [60], [68], the St Andrew’s University website www-history.mcs.st- 
andrews.ac.uk/history/ and the University of Massachusetts website 
www.umass.edu/wsp/statistics/tales/. Conversations with H. Daniels, D.G. 
Kendall and C.R. Rao also provided a few subjects. However, a num- 
ber of stories are just part of folklore (most of them are accessible 
through the Internet); any mistakes are our own responsibility. Pho- 
tographs and portraits of many of the characters mentioned in this book 
are available on the University of York website www.york.ac.uk/depts/ 
maths/histstat/people/ and (with biographies) on http://members.aol.com/ 
jayKplanr/images.htm. 

The advent of the World Wide Web also had another visible impact: a pro- 
liferation of humour. We confess that much of the time we enjoyed browsing 
(quite numerous) websites advertising jokes and amusing quotations; conse- 
quently we decided to use some of them in this book. We apologise to the 
authors of these jokes for not quoting them (and sometimes changing the 
sense of sentences). 

Throughout the process of working on this book we have felt both the 
support and the criticism (sometimes quite sharp) of numerous members 
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of the Faculty of Mathematics and colleagues from outside Cambridge 
who read some or all of the text or learned about its existence. We would 
like to thank all these individuals and bodies, regardless of whether they 
supported or rejected this project. We thank personally Charles Goldie, 
Oliver Johnson, James Martin, Richard Samworth and Amanda Turner, 
for stimulating discussions and remarks. We are particularly grateful to 
Alan Hawkes for the limitless patience with which he went through the 
preliminary version of the manuscript. As stated above, we made wide 
use of lecture notes, example sheets and other related texts prepared by 
present and former members of the Statistical Laboratory, Department of 
Pure Mathematics and Mathematical Statistics, University of Cambridge, 
and Mathematics Department and Statistics Group, EBMS, University of 
Wales-Swansea. In particular, a large number of problems were collected 
by David Kendall and put to great use in Example Sheets by Frank 
Kelly. We benefitted from reading excellent lecture notes produced by 
Richard Weber and Susan Pitts. Damon Wischik kindly provided vari- 
ous tables of probability distributions. Statistical tables are courtesy of 
R. Weber. 

Finally, special thanks go to Sarah Shea-Simonds and Maureen Storey for 
carefully reading through parts of the book and correcting a great number 
of stylistic errors. 


Preface to the second edition 


The second edition differs from the first edition of this book in about 15 per- 
cent of problems and examples. The theoretical part was kept intact al- 
though some portions of the material in Sections 2.1, 2.3, 4.8 and 4.9 were 
added, removed or changed. As in the first edition, the staple of the book 
is University of Cambridge examination questions; the above changes have 
been caused in part by trends in the exam practice. 

There are also some changes in the organisation of the presentation in this 
edition. Remarks, Examples and Worked Examples are numbered in one se- 
quence by section. Some sections end with Problems for which no solution 
is provided. Equations are numbered by section in a separate sequence. Ex- 
amination questions in statistics are given in the final chapter. 

The final stage of the work on the second edition was done when one of 
the authors (YS) was visiting the University of Sao Paulo. The financial 
support of the FAPESP Foundation is acknowledged. 


PART ONE 


BASIC PROBABILITY 


1 


Discrete outcomes 


1.1 A uniform distribution 


Lest men suspect your tale untrue, 
Keep probability in view. 
J. Gay (1685-1732), English poet. 


In this section we use the simplest (and historically the earliest) probabilis- 
tic model where there are a finite number m of possibilities (often called 
outcomes) and each of them has the same probability 1/m. A collection A 
of k outcomes with k < m is called an event and its probability P(A) is 
calculated as k/m: 


P(A) the number of outcomes in A 
An empty collection has probability zero and the whole collection one. This 
scheme looks deceptively simple: in reality, calculating the number of out- 
comes in a given event (or indeed, the total number of outcomes) may be 
tricky. 


— : 1.1.1 
the total number of outcomes ( ) 


Worked Example 1.1.1 You and I play a coin-tossing game: if the coin 
falls heads I score one, if tails you score one. In the beginning, the score is 
zero. (i) What is the probability that after 2n throws our scores are equal? 
(ii) What is the probability that after 2n + 1 throws my score is three more 
than yours? 


Solution The outcomes in (i) are all sequences 


HHH...H,THH...H,...,TTT...T 
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formed by 2n subsequent letters H or T (or, 0 and 1). The total number 
of outcomes is m = 27”, each carries probability 1/22”. We are looking for 
outcomes where the number of Hs equals that of 7's. The number & of such 
outcomes is (2n)!/n!n! (the number of ways to choose positions for n Hs 


among 2n places available in the sequence). The probability in question is 
(2n)! 1 
In (ii), the outcomes are the sequences of length 2n + 1, 2?”*! in total. 


The probability equals 


(2n + 1)! . 1 
(n+2)'"(n—1)! 9 22n+1' 


Worked Example 1.1.2 A tennis tournament is organised for 2” players 
on a knock-out basis, with n rounds, the last round being the final. Two 
players are chosen at random. Calculate the probability that they meet: 
(i) in the first or second round, (ii) in the final or semi-final, and (iii) the 
probability they do not meet. 


Solution The sentence ‘Two players are chosen at random’ is crucial. For 
instance, one may think that the choice has been made after the tournament 
when all results are known. Then there are 2”~! pairs of players meeting in 
the first round, 2”~? in the second round, two in the semi-final, one in the 
final and 2°-! 4+ 9°-2 +...42+4 1=2"~—1 in all rounds. 


Qn 
The total number of player pairs is ( ) = 27-1(2" _ 1). Hence the 


2 
answers: 
(i) gn-1 - gn—2 3 (i) 3 
Y  gm=T(gn 1) 2(2e — 1)’ an—1(Q" — 1)’ 
and 
me ar-1(2r — 1) — (2 —1) if 
(iii) =k : 
eae A = 1) gn-1 


Worked Example 1.1.3 There are n people gathered in a room. 


(i) What is the probability that two (at least) have the same birthday? 
Calculate the probability for n = 22 and 23. 

(ii) What is the probability that at least one has the same birthday as you? 
What value of n makes it close to 1/2? 
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Solution The total number of outcomes is 365”. In (i), the number of out- 
comes not in the event is 365 x 364 x --- x (365 — +1). So, the probability 
that all birthdays are distinct is (365 x 364 x --- x (365 —n + 1)) /365” and 
that two or more people have the same birthday 


365 x 364 x --- x (365-—n+1) 


365” 
For n = 22: 
BOD POE i an ge Poe May: 
365 = 365 365 
and for n = 23: 
365 364 343 
Ue mee apg) 8 ga en 


In (ii), the number of outcomes not in the event is 364” and the probability 
in question 1 — (364/365)". We want it to be near 1/2, so 


364\" 1 1 Pare 
ee oo Mae Lena ~~ 3 7 
365 5” log (364/365) 


Worked Example 1.1.4 Mary tosses n+1 coins and John tosses n coins. 
What is the probability that Mary gets more heads than John? 


Solution We must assume that all coins are unbiased (as it was not spec- 
ified otherwise). Mary has 2”+! outcomes (all possible sequences of heads 
and tails) and John 2”; jointly 2?”*! outcomes that are equally likely. Let 
Ay, and Ty be the number of Mary’s heads and tails and Hy and Ty 
John’s, then Hy + Ty = n+1 and Hy + Tj =n. The events {Hy > H3} 
and {TM > Ty} have the same number of outcomes, thus P(H\y > H3) = 
P(Ty > T5). 

On the other hand, Hy > Hy if and only if n —- Hy < n— Hy, ie. 
Tum —1 < Ty or Ty < Ty. So event Hy > Hy is the same as Ty < 73, and 
P(Tu < T;) = P(Hm > Hy); 

But for any (joint) outcome, either Ty, > Tj or Thy < Ty, i.e. the number 
of outcomes in {Ty > Ty} equals 2?"+! minus that in {Tq < Ty}. Therefore, 
P(Tm > Ty) =1-—P(Im < T3). To summarise: 


P( Hu > Hy) = P(T™ iy) =1 — P(Tm < Te) =] — P(Hy > Hy), 


whence P(Hum > Hy) = 1/2. 


Solution Suppose that the final toss belongs to Mary. Let x be the prob- 
ability that Mary’s number of heads equals John’s number of heads just 
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before the final toss. By the symmetry, the probability that Mary’s number 
of heads exceeds that of John just before the final toss is (1 — x)/2. This 
implies that the Mary’s number of heads exceeds that of John by the end of 
the game equals (1 — x)/2+2/2 = 1/2. 


Solution By the end of the game Mary has either more heads or more tails 
than John because she has more tosses. These two cases exclude each other. 
Hence, the probability of each case is 1/2 by the symmetry argument. 


Worked Example 1.1.5 You throw 6n six-sided dice at random. Show 
that the probability that each number appears exactly n times is 


(6n)! (1\%" 

(ni)® (3) | 
Solution There are 6°” outcomes in total (six for each die), each has prob- 
ability 1/66". We want n dice to show one dot, n two, and so forth. The 
number of such outcomes is counted by fixing first which dice show one: 
(6n)!/[n!(5n)!]. Given n dice showing one, we fix which remaining dice show 


two: (5n)!/[n!(4n)!], etc. The total number of desired outcomes is the prod- 
uct that equals (6n)!(n!)®. This gives the answer. 


In many problems, it is crucial to be able to spot recursive equations 
relating the cardinality of various events. For example, for the number fy 
of ways of tossing a coin n times so that successive tails never appear: 
fn = fn-1 + fn—2, N > 3 (a Fibonacci equation). 


Worked Example 1.1.6 (i) Determine the number gr», of ways of tossing 
a coin n times so that the combination HT never appears. (ii) Show that 


fn = fn—-1 + fn—2 + fn—3, n > 3, is the equation for the number of ways of 
tossing a coin n times so that three successive heads never appear. 


Solution (i) g, = 1+; 1 for the sequence HH ...H, n for the sequences 
T...TH...H (which includes T...T). 

(ii) The outcomes are 2” sequences (y1,...,Yn) of H and T. Let A, be 
the event {no three successive heads appeared after n tosses}, then f, is 
the cardinality #A,. Split: A, = BO) U BY) U Be), where BO) is the event 
{no three successive heads appeared after n tosses, and the last toss was a 
tail}, Be) = {no three successive heads appeared after n tosses, and the last 
two tosses were TH} and B®) ={no three successive heads appeared after 
n tosses, and the last three tosses were TH H}. 

Clearly, BOnBY =0,1<i4j <3, andso f, = #BY +4 BO +#Be. 
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Now drop the last digit yn: (yi,---Yn) € BY if and only if y, = T, 


(y1,---Yn—1) € An-1, ie. # BY = fri Also; (yg ss Ya) B®) if and only 
if yxn-1 = T, Yn = H, and (y1,..-Yn—-2) € An—2. This allows us to drop 
the two last digits, yielding #Be?) = fn—2. Similarly, #Bo) = fn—3. The 
equation then follows. 


Worked Example 1.1.7 In a Cambridge cinema n people sit at random 
in the first row. The row has N > n seats. Find the probability of the 
following events: 


(a) that no two people sit next to each other; 

(b) that each person has exactly one neighbour; and 

(c) that, for every pair of distinct seats symmetric relative to the middle of 
the row, at least one seat from the pair is vacant. 


Now assume that n people sit at random in the two first rows of the same 
cinema, with 2N > n. Find the probability of the following events: 


(d) that at least in one row no two people sit next to each other; 

(e) that in the first row no two people sit next to each other and in the 
second row each person has exactly one neighbour; and 

(f) that, for every pair of distinct seats in the second row, symmetric relative 
to the middle of the row, at least one seat from the pair is vacant. 


In parts (d)-(f) you may find it helpful to use indicator functions specifying 
limits of summation. 


Solution We assume that n > 1. In parts (a)-(c), the total number of 
outcomes equals C iF and all of them have the same probability. (All people 
are indistinguishable.) Then: 

(a) The answer is Gian ) if N > 2n—1 and Oif N < 2n—1.In 
fact, to place n people in N seats so that no two of them sit next to each 
other, we scan the row from left to right (say) and affiliate, with each of 
n seats taken, an empty seat positioned to the right. Place an extra empty 
seat to the right of the person in the position to the right end of the row. 
This leaves N —n-+1 virtual positions where we should place n objects. The 
objects are empty seats to the right of occupied ones. 

(b) First, assume n = 21 is even. Then we have n/2 = | pairs of neigh- 
bouring occupied seats, and with each of n/2 of them we again affiliate an 


empty seat to the right. Thus the answer is Ger: () if N > 3n/2-1 


n/2 
and 0 if N < 3n/2—1. 
If n is odd, the probability in question equals 0. 
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(c) First, assume that N is even. Then, if we require that all people sit in 


the left-hand side of the row, the number of outcomes is e / ay Now, for each 


n 
person we have 2 symmetric choices. Hence, the answer is 2” ee / ?) } ee ) if 
N > 2n and 0 if N < 2n. 

When JN is odd, there is a seat in the middle: it can be taken or vacant. 


Thus, the answer in this case is: 
fan) gant ((NoB)] /(N) if N > Qn +1, 


gn ((N=1)/2) ‘) eC) if N=2n-1, 


n 


0 if N <2n-1. 


In parts (d)—(f), we have in total 
n= So Un-N<k<Ny(~ 
n k/\n-k 
0<k<n 
outcomes, again of equal probability. 
(d) The answer is therefore 
~ PCR GY)UR-N sk <(N+1)/2) 
0<k<n 
— (OY) OE un - (N+ 2s k< (N+ 0)/2) | / (a 


n 


(e) The answer is 
N-2141) (N=n+21+1 
on DC gear) 
0<I<n/2 


x 1(n/2-(N41)/4<1< wv + 1/3) /%). 


(f) For N even, the answer is 


E 2°) ("uv cecwar/ (2) 


0<k<n 


For N odd, the answer is 


De 1c < (N — 1)/2)2* ae Pe 


i +1(k < (N +1)/2)2*-1 or) 


(nea) tonne) (a) 
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1.2 Conditional Probabilities. Bayes’ Theorem. 
Independent trials 


Probability theory is nothing but common sense reduced to calculation. 
P.-S. Laplace (1749-1827), French mathematician. 


Clockwork Omega 
(From the series ‘Movies that never made it to the Big Screen’.) 


From now on we adopt a more general setting: our outcomes do not neces- 
sarily have equal probabilities p1,..., pm, with pj; > 0 and py +---+pm = 1. 

As before, an event A is a collection of outcomes (possibly empty); the 
probability P(A) of event A is now given by 


PA)= So p= S> plie A). (1.351) 


outcome iC A outcome 7% 
(P(A) = 0 for A = 0.) Here and below, J stands for the indicator function, 
viz.: 
1, ifieA, 
0, otherwise. 


Hie A)={ 


The probability of the total set of outcomes is 1. The total set of outcomes 
is also called the whole, or full, event and is often denoted by 2, so P(Q) = 1. 
An outcome is often denoted by w, and if p(w) is its probability, then 


P(A) = 5) p(w) = So pw) I(w € A). (1.2.2) 
weA weEQ 


As follows from this definition, the probability of the union 
P(A U Ag) = P(A1) + P(A2) (12:3) 
for any pair of disjoint events A,, Ag (with A; MN Az = 9). More generally, 
P(A, U---U A,) = P(Ai) +--+ + P(An) (1.2.4) 


for any collection of pair-wise disjoint events (with A; MA; = 0 for all 
j #j'). Consequently, (i) the probability P(A‘) of the complement AS = 0\A 
is 1 — P(A), (ii) if B C A, then P(B) < P(A) and P(A) — P(B) = P(A\B) 
and (iii) for a general pair of events A,B: P(A\B) = P(A\(AN B)) = 
P(A) —P(AN B). 
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Furthermore, for a general (not necessarily disjoint) union: 


P(A, U---U An) < $° P(Aa); 
i=1 


a more detailed analysis of the probability P(\J A;) is provided by the 
exclusion—inclusion formula (1.3.1); as follows. 

Given two events A and B with P(B) > 0, the conditional probability 
P(A|B) of A given B is defined as the ratio 


P(AN B) 


P(AIB) = Som 


(1.2.5) 

At this stage, the conditional probabilities are important for us because of 
two formulas. One is the formula of complete probability: if B,,...,B, are 
pair-wise disjoint events partitioning the whole event Q, i.e. have B;B; = 0 
for 1 <i<j<nand Bij\UBoU---UBn = Q, and in addition P(B;) > 0 
for 1 <i<n, then 


P(A) = P(A|B1)P(B1) + P(A|B2)P(Bo) +--+ +P(A|B,)P(Bn). (1.2.6) 


The proof is straightforward: 


P(4)= So P(ANB) = Sp OBB) = YT PAB) PCB). 


P(Bi) : 
l<i<n 1l<i<n 1<i<n 


The point is that often it is conditional probabilities that are given, and 
we are required to find unconditional ones; also, the formula of complete 
probability is useful to clarify the nature of (unconditional) probability P(A). 
Despite its simple character, this formula is an extremely powerful tool in 
literally all areas dealing with probabilities. In particular, a large portion of 
the theory of Markov chains is based on its skilful application. 

Representing P(A) in the form of the right-hand side (RHS) of (1.2.6) is 
called conditioning (on the collection of events B,,..., Bn). 

Another formula is the Bayes formula (or Bayes’ Theorem) named after 
T. Bayes (1702-1761), an English mathematician and cleric. It states that 
under the same assumptions as above, if in addition P(A) > 0, then the 
conditional probability P(B;|A) can be expressed in terms of probabilities 
P(B1),...,P(Bn) and conditional probabilities P(A|B,),...,P(A|Bn) as 


P(A|B;)P(B;) 


Y P(A|B;)P(B;) 
1<j<n 


P(B;|A) = 


(1.2.7) 
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The proof is the direct application of the definition and the formula of com- 
plete probability: 


P(B,|A) = P(A) 


and 


P(A) =} P(A|B;)P( Bj). 
j 


A standard interpretation of equation (1.2.7) is that it relates the posterior 
probability P(.B;|A) (conditional on A) with prior probabilities {P(B;)} (valid 
before one knew that event A occurred). 

In his lifetime, Bayes finished only two papers: one in theology and one 
called ‘Essay towards solving a problem in the doctrine of chances’; the latter 
contained Bayes’ Theorem and was published two years after his death. 
Nevertheless he was elected a Fellow of The Royal Society. Bayes’ theory (of 
which the above theorem is an important part) was for a long time subject 
to controversy. His views were fully accepted (after considerable theoretical 
clarifications) only at the end of the nineteenth century. 


Worked Example 1.2.1 Four mice are chosen (without replacement) 
from a litter containing two white mice. The probability that both white 
mice are chosen is twice the probability that neither is chosen. How many 
mice are there in the litter? 


Solution Let the number of mice in the litter be n. We use the notation 
P(2) = P(two white chosen) and P(0) = P(no white chosen). Then 


r= ("5)/ (2) 


Otherwise, P(2) could be computed as: 


2 1 jee 1 _2n-2n—3 1 n—-2 2 1 
nn—1 nnn? when =n 3 n n-1n-2 
n-2n-—3 2 1 n-2 2 n-—3 1 12 


ain n—-In-2n-3 n(n—1) 


On the other hand, 
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Otherwise, P(0) could be computed as follows: 


po) =” 2n-—3n-4n 5 _ (n—4)(n—5) 


n n—-l1ln-2n-3 n(n — 1) 
Solving the equation 
12 (n — 4)(n — 5) 
=? , 
n(n — 1) n(n — 1) 


we get n = (9+ 5)/2; n = 2 is discarded as n > 6 (otherwise the second 
probability is 0). Hence, n = 7. 


Worked Example 1.2.2 Lord Vile drinks his whisky randomly, and the 
probability that, on a given day, he has n glasses equals e~! if nl,n=0,1,.... 
Yesterday his wife Lady Vile, his son Liddell and his butler decided to murder 
him. If he had no whisky that day, Lady Vile was to kill him; if he had 
exactly one glass, the task would fall to Liddell, otherwise the butler would 
do it. Lady Vile is twice as likely to poison as to strangle, the butler twice as 
likely to strangle as to poison, and Liddell just as likely to use either method. 
Despite their efforts, Lord Vile is not guaranteed to die from any of their 
attempts, though he is three times as likely to succumb to strangulation as 
to poisoning. 

Today Lord Vile is dead. What is the probability that the butler did it? 


Solution Write P(dead | strangle) = 3r, P(dead | poison) = r, and 


1 
P(drinks no whisky) = P(drinks one glass) = -, 
e 


2 
P(drinks two glasses or more) = 1 — ~ 
Next, 
Pos . 2 
P(strangle | Lady V) = 3? P(poison | Lady V) = 3? 
2. 1 
P(strangle | butler) = 3? P(poison | butler) = 3 
and 


1 
P(strangle | Liddell) = P(poison | Liddell) = 5° 
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Then the conditional probability P(butler | dead) is 


P(d|b)P(b) 
P(d|b)P(b) + P(d|LV)P(LV) + P(d|Lddl)P(Lddl) 


(29) 
( =) (FH +5)+E(F ae 


—2 
= & 0.3137. 
e— 3/7 


Worked Example 1.2.3 At the station there are three payphones which 
accept 20p pieces. One never works, another always works, while the third 
works with probability 1 / 2. On my way to the metropolis for the day, I wish 
to identify the reliable phone, so that I can use it on my return. The station 
is empty and I have just three 20p pieces. I try one phone and it does not 
work. I try another twice in succession and it works both times. What is the 
probability that this second phone is the reliable one? 


Solution Let A be the event in the question: the first phone tried did not 
work and the second worked twice. Clearly, 
P(A|1st reliable) = 0, 
P(A|2nd reliable) = P(1st never works | 2nd reliable) 
1 
te x P(1st works half-time | 2nd reliable) 
Pols fe 8 


jie as aa 
and the probability P(A|3rd reliable) equals 


Lek 
reas P(2nd works half-time | 3rd reliable) = — 


The required probability P(2nd reliable) is then 
1/3 x 3/4 6 
1/3 x (0+3/44+1/8) 7 


Worked Example 1.2.4 Parliament contains a proportion p of Labour 
Party members, incapable of changing their opinions about anything, and 
1 — p of Tory Party members changing their minds at random, with prob- 
ability r, between subsequent votes on the same issue. A randomly chosen 
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parliamentarian is noticed to have voted twice in succession in the same way. 
Find the probability that he or she will vote in the same way next time. 
Solution Set 

A, = {Labour chosen}, A, = {Tory chosen}, 


B = {the member chosen voted twice in the same way}. 


We have P(A1) = p, P(A2) = 1—p, P(B|A1) = 1, P(B|A2) = 1-1. We want 
to calculate 


P(Ai|B) = a 7 SCE 


and P(A,2|B) = 1 — P(A,|B). Write 


P(B) = P(A1)P(B/A1) + P(A2)P(BlA2) = p- 1+ (1—p)(1—7). 
Then 


(Lr) b= p) 
pnp) 


P 
prior) py 


and the answer is given by 


P(A;|B) = 


P(A9|B) = 


pedis nil=») 
p+(1-r)(1—p)- 


P(the member will vote in the same way|B) 


Worked Example 1.2.5 The Polya urn model is as follows. We start with 
an urn which contains one white ball and one black ball. At each second we 
choose a ball at random from the urn and replace it together with one more 
ball of the same colour. Calculate the probability that when n balls are in 
the urn, 7 of them are white. 


Solution Denote by P,, the conditional probability given that there are n 
balls in the urn. For n = 2 and 3 

1, n=2 

1 


P,,(one white ball) = { as 


Oe) 
and 


P,,(two white balls) = 5, n=3. 


Make the induction hypothesis 


1 
P,.(i white balls) = ray 
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for allk = 2,...,n -landi=1,...,4—1. Then, after n — 1 trials (when 
the number of balls is 7), 


P,, (2 white balls) 


= P,_1(t — 1 white balls) x sew 


n 
n—-1 


n—1(% white balls) x 


1 
P,(4 white balls) = aa t=1,...,n—1. 
n 


Worked Example 1.2.6 You have n urns, the rth of which contains r— 1 
red balls and n —r blue balls, r = 1,...,n. You pick an urn at random and 
remove two balls from it without replacement. Find the probability that the 
two balls are of different colours. Find the same probability when you put 
back a removed ball. 


Solution The totals of blue and red balls in all urns are equal. Hence, the 
first ball is equally likely to be any ball. So 


1 
P(1st blue) = = P(1st red). 


Now, 


n 


P(1st red, 2nd blue) = si (1st red, 2nd blue | urn & chosen) x . 
k=1 


= ac a2) fa Se one 


2 Peas (3 Ea = : 


We used here the following well-known identity: 


n 


So i(i-1) = +(n+ 1)n(n — 1). 


i=1 
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By symmetry: 
1 1 


P(different colours) = 2 x a: 


If you return a removed ball, the probability that the two balls are of 
different colours becomes 


P(1st red, 2nd blue) = : ee ay el) 
n(n — 1)? 2 3 
_ = 
~ 6(n— 1) 
So the solution is Pia 
3(n — 1) 


Worked Example 1.2.7 The authority of Ruritania took a decision to 
pardon and release one of three prisoners X, Y and Z imprisoned in solitary 
confinement in a notorious Alcazar prison. The prisoners know about this 
decision but have no clue who the lucky one is, the waiting is agonising. A 
sympathetic but corrupt prison guard approaches prisoner X with an offer 
to name for some bribe another prisoner (not X) who is condemned to stay. 
He says: ‘This will reduce your chances of staying from 2/3 down to 1/2. It 
should be a relief for you’. After some hesitation X accepts the offer and 
the guard names Y. 

Now, suppose that the information is correct and Y would not be released. 
Find the conditional probability 


P(X, Yremains in prison) 
P(Ynamed) 


P(Xremains in prison|Y named) = 


and check the validity of the guard’s claim in the following three cases: 

(a) The guard is unbiased, i.e. he names Y or Z with probabilities 1/2 in 
the case both Y and Z are condemned to remain in prison. 

(b) The guard hates Y and definitely names him in the case Y is condemned 
to remain in prison. 

(c) The guard hates Z and definitely names him in the case Z is condemned 
to remain in prison. 


Solution By symmetry, 
P(Xreleased) = 1/3 and P(Xremains in prison) = 2/3, 
and the same is true for Y and Z. Next, find the conditional probability 


P(X, Yremains in prison) 
P(Ynamed) 


P(Xremains in prison|Y named) = 
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where P(X, Yremains in prison) = 1/3. In the case (a) we have 


P(Ynamed) = P(X, Yremains in prison) + 5P(Y, Zremains in prison) 
1 1 


and the guard’s claim is wrong: his information does not affect the chances 
of the prisoner. In the case (b) we have 


2 
P(Ynamed) = P(X, Yremains in prison) + P(Y, Zremains in prison) = 3 


and 


1 
P(Xremains in prison|Ynamed) = 5 
and the prison guard is correct. Finally, in the case (c) 
1 
P(Ynamed) = P(X, Yremains in prison) = 3" 


In this case 


P(Xremains in prison|Ynamed) = 1 


and the prison guard is wrong again. 


Worked Example 1.2.8 You are on a game show and given a choice of 
three doors. Behind one is a car; behind the two others are a goat and a pig. 
You pick door 1, and the host opens door 3, with a pig. The host asks if you 
want to pick door 2 instead. Should you switch? What if instead of a goat 
and a pig there were two goats? 


Solution A popular solution of this problem always assumes that the host 
knows behind which door the car is, and takes care not to open this door 
rather than doing so by chance. (It is assumed that the host never opens 
the door picked by you.) In fact, it is instructive to consider two cases, 
depending on whether the host does or does not know the door with the 
car. If he doesn’t, your chances are unaffected, otherwise you should switch. 
Indeed, consider the events 


Y¥e= {you pick door i}, Ci = {the car is behind door iN, 

A; = {the host opens door is GaP = {a goat/pig is behind door i}, 
with P(Y;) = P(C;) = P(G;) = P(P;) = 1/3,7 = 1,2,3. Obviously, event Y; 
is independent of any of the events Cj, G; and P;, i,j = 1, 2,3. 

You want to calculate 
P(CL NY, N H3 2M P3) 

P(Yi 1 ABM P3) 


P(C1|% 1 ABN P3) = 
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In the numerator: 
P(Ci NY, A3N P3) 
— P(C))P(¥%1|C1)P(P3|C1 M Y,)P(H3|C1 AYN P3) 
1 1 1 41 1 
SB Be ee Oe 736s 


If the host doesn’t know where the car is, then 


P(Yi ABN P3) 
= P(Y1)P(P3|Y1)P(43|%1 N P3) 
1 1 i 1 
— x x — 5 
3. 3 2 18 
and P(Ci|Yi N H3 9 P3) = 1/2. But if he does then 


P(Yi N A310 P3) 
= P(¥, NC, N AsO P3) + P(Y¥1 9 CoN H3 NM P3) 
Tae Te his Tee A cd. yall 1 
= xxix x = —, 
3 3 2 2 3 3 2 12 
and P(C1|% ABM P3) = 1/3. 
The answer remains the same if there were two goats instead of a goat 


and a pig. Another useful exercise is to consider the case where the host 
has some ‘preference’ choosing a door with the goat with probability py, and 
that with the pig with probability pp = 1 — pg. 


We continue our study by introducing the definition of independent events. 
The concept of independence was an important invention in probability 
theory. It shaped the theory at an early stage and is considered one of the 
main features specifying the place of probability theory within more general 
measure theory. 

We say that events A and B are independent if 


P(AN B) = P(A)P(B). (1.2.8) 


A convenient criterion of independence is: events A and B, where say P(B) > 
0 are independent if and only if P(A|B) = P(A), ie. knowledge that B 
occurred does not change the probability of A. 

Trivial examples are the empty event ( and the whole set 2: they are 
independent of any event. The next example we consider is when each of the 
four outcomes 00,01, 10, and 11 have probability 1/4. Here the events 


A = {lst digit is 1} and B = {2nd digit is 0} 
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are independent: 
P(A) = pio + Pu = ; = pio + poo = P(B), P(AN B) = pio = : = 
Also, the events 
{1st digit is 0} and {both digits are the same} 
are independent, while the events 
{1st digit is 0} and {the sum of digits is > 0} 


are dependent. 

These examples can be easily reformulated in terms of two unbiased coin- 
tossings. An important fact is that if A, B are independent then A° and B 
are independent: 


P(A° nN B) = P(B\(AN B)) = P(B) — P(AN B) 
= P(B) — P(A)P(B) (by independence) 
= [1 — P(A)|P(B) = P(AS)P(B). 


Next, if (i) Ay and B are independent, (ii) Ag and B are independent, and 
(iii) Ay and Ag are disjoint, then A; U Ag and B are independent. If (i) and 
(ii) hold and A; C Ag then B and A2\A; are also independent. 

Intuitively, independence is often associated with an ‘absence of any con- 
nection’ between events. There is a famous joke about Andrey Nikolayevich 
Kolmogorov (1903-1987), the renowned Russian mathematician considered 
the father of the modern probability theory. His monograph, [84], which orig- 
inally appeared in German in 1933, was revolutionary in understanding the 
basics of probability theory and its rdle in mathematics and its applications. 
When in the 1930s this monograph was translated into Russian, the Soviet 
government enquired about the nature of the concept of independent events. 
A senior minister asked if this concept was consistent with materialistic de- 
terminism, the core of Marxist-Leninist philosophy, and about examples of 
such events. Kolmogorov had to answer on the spot, and he had to be cau- 
tious as subsequent events showed, such as the infamous condemnation by 
the authorities of genetics as a ‘reactionary bourgeois pseudo-science’. The 
legend is that he did not hesitate for a second, and said: ‘Look, imagine a 
remote village where there has been a long drought. One day, local peasants 
in desperation go to the church, and the priest says a prayer for rain. And 
the next day the rain arrives! These are independent events’. 
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In reality, the situation is more complex. A helpful view is that indepen- 
dence is a geometric property. In the above example, the four probabilities 


Poo; Poi, Pio and pi, 


can be assigned to the vertices 
(0;0), (0;1), (131) and (1;0) 


of a unit square. See Figure 1.1. Each of these four points has a projection 
onto the horizontal and the vertical line. The projections are points 0 and 1 
on each of these lines, and a vertex is uniquely determined by its projections. 
If the projection points have probability mass 1/2 on each line then each 


vertex has 


1 1 1... 
Pg OX OTD 1,7 =0,1. 


In this situation one says that the four-point probability distribution 


111i 
A’ A a” a 


is a product of two two-point distributions 


{3 a}: 


It is easy to imagine a similar picture where there are m points along the 
horizontal and n along the vertical line: we would then have mn pairs (i, 7) 
(lattice sites) where i= 0,...,m—1, 7 =0,...,n—1 and each pair will re- 
ceive probability mass 1/mn. Moreover, the equidistribution can be replaced 
by a more general law: 

(),(2) 


Pij =P, Pz > 
where 
1). 2 
p,i=0,...,.m—1, and p, j=0,...,n—1, 
—+ 1/2 
1/4 1/4 

(1) (2) 
Pi = Pi Pj 

1/4 1/4 1/2 

1/2 1/2 


Figure 1.1 
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are probability distributions for the two projections. Then any event 
that is expressed in terms of the horizontal projection (for example, 
{digit i is divisible by 3}) is independent of any event expressed in terms 
of the vertical projection (for example, {digit 7 is < n/2}). This is a basic 
(and in a sense the only) example of independence. 

More generally, we say that events A,,...,A, are mutually independent 
(or simply independent) if for all subcollections A;,,...,Aj,, 


P(A, Meee A Ai,) = P(Aj,) tee P(A;,). (1.2.9) 


This includes the whole collection A;,..., Ay. It is important to distinguish 
this situation from the one where (1.2.9) holds only for some subcollections; 
say pairs Aj, Aj, 1 <i <j <n, or only for the whole collection Aj,..., An. 
See Worked Examples 1.2.17 and 1.2.18. 

This gives rise to an important model: a sequence of independent trials, 
each with two or more outcomes. Such a model is behind many problems, 
and it is essential to familiarise yourself with it. 

So far, we assumed that the total number of outcomes w is finite, but the 
material in this section can be extended to the case where 2 is a count- 
able set, consisting of points 71, x2,..., say, with assigned probabilities 
pi = P({ai}), i =1, 2,.... Of course, the labelling of the outcomes can be 
different, for instance, by 7 € Z, the set of integers. The requirements are as 
before: each p; > 0 and D>, p; = 1. 

We can also work with infinite sequences of events. For example, equa- 
tions (1.2.6) and (1.2.7) do not change form: 


P(4)= S> P(AIB))P(B;), P(B|A) = ABP) 


iges > P(A|B;)P(B;) 
1<j<co 


(1.2.10) 


Worked Example 1.2.9 A coin shows heads with probability p on each 
toss. Let a, be the probability that the number of heads after n tosses is 
even. By showing that m4, = (1 — p)t, + p(1 — 7), n > 1, or otherwise, 
find 7. (The number 0 is considered even.) 


Solution As always in the coin-tossing models, we assume that outcomes 
of different throws are independent. Set A, = {nth toss is a head}, with 


P(A,) = p and B, = {even number of heads after n tosses}, with tm, = 


P(B,). Then, by conditioning on A,41 and Aj, ;: 


P(Bn41) = P(Bn+1 a An4+1) + P(Bn41 a Ans) 
— P(Br4ilAn+i1)P(An+1) at P(Bn4i1|Angi)P(An41). 
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Next, Bn4iN Anti = BRO Anti and BnyiN Aji, = Bn Aj41- In view of 
independence, 


P(B;, A Anti) = P(B;,)P(An+1), and P(BnM Any) = P(Bn)P(An+1); 
which implies 
P(Bn41) = P(B;,)P(Anti) + P(Bn)P(Ani1) = (1 — P(Bn))p + P(Bn)C — p), 
with P(Bo) = 1. That is, 


Tn+41 = (1 — p)tn + p(1 — tn) = (1 — 2p) ap + p. 


Substituting 7, = a(1—2p)"+b gives that b = 1/2, and the condition 7 = 1 
that a = 1/2. Then z,, = [(1 — 2p)” + 1] /2. 
A shorter way to derive the recursion is by conditioning on B, and Be: 


Try =P Baga) =P Bra Ba) +P Bri NBs) 
= P(Aj41|Bn)P( Brn) + P(An+i|Bp)P(Bz) 
(1 == p)Tn + p(1 ae ile 


Writing recursive equations like the one in the statement of the current 


problem is a convenient instrument used in a great many situations. 


Solution (Look at this solution after reading Section 1.5.) Let X; = 0 or 1 
be the outcome of the ith trial, and Y, = X; +---+ Xp the total number 
of successes in n trials. Then the probability generating function of Y,, 


p(s) = [ps + (1 — p)]”. 
Then the probability that n trials result in an even number of successes is 
1 1 


st) + (01 = 5 


1+ (1 —2p)"}. 


Worked Example 1.2.10 My Aunt Agatha has given me a clockwork 
orange for my birthday. I place it in the middle of my dining table which 
happens to be exactly 2 metres long. One minute after I place it on the table 
it makes a loud whirring sound, emits a puff of green smoke and moves 10 
cm towards the left-hand end of the table with probability 3/5, or 10 cm 
towards the right with probability 2/5. It continues in this manner (the 
direction of each move being independent of what has gone before) at one 
minute intervals until it reaches the edge of the table where it promptly falls 
off. If it falls off the left-hand end it will break my Ming vase (also a present 
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from Aunt Agatha). If it falls off the right-hand end it will land safely in a 
bucket of water. What is the probability that the Ming vase will survive? 


Solution Set pe to be 
P(falls at RH end|was at distance x 10cm from the LH end), 
then 1 — pg equals 
P(falls at LH end|was at distance x 10cm from LH end). 


We have po =0, poo = 1 and pe = 2py_-1 + 2peqi oF 


_ 5 3 
Pel = Pe Pe-1- 


In other words, vector (pg, pe+1) = (pe-1, pe)A with 
3 
A= ( a ) 
1 3 


(pe, pe+1) = (po, p1)A* = (0, p1) A‘, 


This yields 


i.e. pe should be a linear combination of the @th powers of the eigenvalues of 
A. The eigenvalues are \1 = 3 and A2 = 1 and so: 


3\¢ 
pe = by (5) + bo. 


3 20 
bh +b =0, 1=b; (3) + bg, 


We have the equations 


whence 


and 


7 en 1 i: ats = 4 
PO \2) (8/2™-1 (3/2)™-1 (3/2) 41 


Worked Example 1.2.11 Dubrovsky sits down to a night of gambling 
with his fellow officers. Each time he stakes u roubles there is a probability 
r that he will win and receive back 2u Roubles (including his stake). At the 
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beginning of the night he has 8000 Roubles. If ever he has 256000 Roubles 
he will marry the beautiful Natasha and retire to his estate in the country. 
Otherwise, he will commit suicide. He decides to follow one of two courses 
of action: 


(i) to stake 1000 Roubles each time until the issue is decided; 
(ii) to stake everything each time until the issue is decided. 


Advise him (a) if r = 1/4 and (b) if r = 3/4. What are the chances of a 
happy ending in each case if he follows your advice? 


Solution Let pe be the probability that Dubrovsky wins 256000 with the 
starting capital @ thousands while following strategy (i). Reasoning as in 
Worked Example 1.2.10 yields that 


pe = bj + b2X5 
where \; = (1—1)/r and 2 = 1 are the eigenvalues of the matrix 
sip PW UATE 
al Cae 


The boundary conditions pp = 0 and pos5g = 1 yield 


ce 


For r = 1/4, (1 —r)/r =3. Then he should choose (ii) as 


-1 


_ 3-1 
P8 ~ 3356 _ 1? 


which is tiny compared with (1/4)°, the chance to win 256 000 in five suc- 
cessful rounds by gambling on everything he obtains. 
For r = 3/4, (l—r)/r =1/3. Then 
1 (1/3)° 
Le 256 
1 — (1/3) 


which is much larger than (3/4)°. Therefore, he should choose (i). 


Remark 1.2.12 In both Worked Examples 1.2.10 and 1.2.11 one of the 
eigenvalues of the recursion matrix A equals 1. This is not accidental and 
is due to the fact that in equation (1.2.6) (which gives rise to the recursive 
equations under consideration) the sum 7, P(B;) = 1. 
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Worked Example 1.2.13 I play the dice game ‘craps’ against ‘Lucky’ 
Pete Jay as follows. On each throw I throw two dice. If my first throw is 7 or 
11, then I win and if it is 2, 3 or 12, then I lose. If my first throw is none of 
these, I throw repeatedly until I score the same as my first throw, in which 
case I win, or I throw a 7, in which case I lose. What is the probability that 


I will win? 


Solution Write 


P(I win) = P(I win at the 1st throw) + P(I win, but not at the lst throw). 


The probability P(I win at the 1st throw) is straightforward and equals 


6 


6 
1 : eer 6 2 2 
2 36! G+j=7)+ 0 SIG4+j=W=at+a=s. 


a,j=1 a 


Here 
oy fl, i+ g=7, 
ee { 0, otherwise. 


For the second probability we have 


P(I win, but not at the 1st throw) = Ss" Didi- 
i=4,5,6,8,9,10 


Here 


pi = P(the 1st score is 2) 


Q; = P(get i before 7 in the course of repeated throws|the 1st score is 7) 


= P(get i before 7 in the course of repeated throws). 


Then for Q;, by conditioning on the result of the first throw: 


Qi = pit C=p-= 27) Qi; Le: 6 = 
Pi cae 


Equivalently, 
Qi = pit (1—pi-pr)pit (1 —pi-p7)*pi t+, 


with the same result. 
Now 


3/36 3 1, 4/36 — 4 


Qa = 5 Qe 


3/36+ 6/36 34+6 3° °° 4/36+6/36 416 


5” 


26 Discrete outcomes 


and likewise, 


5/36 5 5 5 . ; 
— 5736+6/36 5+6 117 “Ty 95 So= sz 


Q6 
giving for P(I win, but not at the lst throw) the value 


eee Ce et LR ee EN en One ee 1 134 
1 BO Bs S686 a A 8 8 A905 


134 244 


. 2 


Worked Example 1.2.14 Two darts players throw alternately at a board 
and the first to score a bull wins. On each of their throws player A has 
probability pa and player B pp of success; the results of different throws are 
independent. If A starts, calculate the probability that he/she wins. 


Solution Consider the diagram below. 


b= pK 1— pp Le pe 1— pp 
e— e—> e—> e-— 
pa \s PBN PA PBN. 

A wins B wins A wins B wins 


If Q = P(A wins), then 


Q=pat(1—pa)(1 —pp)pa + (1 — pa)?(1 — pp)*pa + °- 
PA PA 


~ 1-(1-pa)(1—pp) pa +pB- Papp 


Equivalently, by conditioning on the first and the second throw, one gets the 
equation 


P(A wins) = pg + (1 —pa)(1 — pg)P(A wins), 


which is immediately solved to give the required result. 


Remark 1.2.15 In Worked Examples 1.2.13 and 1.2.14 we used an equa- 
tion for the probabilities Q and Q; that was equivalent to their representation 
as series. This is another useful idea; for example, it allowed us to avoid the 
use of infinite outcome spaces. However, we will not be able to avoid it much 
longer. 
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Worked Example 1.2.16 A fair coin is tossed until either the sequence 
HAHA occurs in which case I win or the sequence T HH occurs, when you 
win. What is the probability that you will win? 


Solution In principle, the results of the game could be I win, you win 
and the game lasts forever. Observe that I win only if HHH occurs at the 
beginning: the probability is (1/2)° = 1/8. Indeed, if HHH occurs but not 
at the beginning then THA should have occurred before then you will have 
already won. But HHH will appear sooner or later, with probability 1. In 
fact, for all N, the event 


A = {HHH never occurs} 
is contained in the event 
Ay = {HHH does not occur among the first N subsequent triples} 


(we partition first 3N trials into N subsequent triples). So P(A) < P(Ay). 
But the probability P(Ay) = (1 — 1/8)“ > 0 as N > ov. Hence, P(A) = 0. 
Therefore, the game cannot continue forever, and the probability that you 
will win is 1-1/8 =7/8. 


Worked Example 1.2.17 (i) Give examples of the following phenomena: 


(a) three events A, B, C that are pair-wise independent but not indepen- 
dent; 

(b) three events that are not independent, but such that the probability of 
the intersection of all three is equal to the product of the probabilities. 


(ii) Three coins each show heads with probability 3/5 and tails otherwise. 
The first counts 10 points for a head and 2 for a tail, the second counts 4 
points for a head and tail, and the third counts 3 points for a head and 20 
for a tail. 

You and your opponent each choose a coin; you cannot choose the same 
coin. Each of you tosses your coin once and the person with the larger score 
wins 10!° points. Would you prefer to be the first or the second to choose 
a coin? 


Solution (i) (a) Toss two unbiased coins, with 
A = {1st toss shows H}, B = {2nd toss shows H} 


and 


C = {both tosses show the same side}. 
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Then 
(ANB) = pun = 7 = P(A)P(B), P(ANC) = pr = | = PLAIP(C), 
(BC) = pun = + = P(B)P(C) 


1 
P(AN BNC) = poy = Z # P(A)P(B)P(C). 
Or throw three dice, with 


A = {die one shows an odd score}, B = {die two shows an odd score}, 
C = {overall score odd} 


and P(A) = P(B) = P(C) = 1/2. Then 
1 
P(ANB) =P(ANC)=P(BNC) = T but PAN BNC) =0. 
(b) Toss three coins, with 


A = {lst toss shows H}, B = {3rd toss shows H}, 
C={HHH, HHT, HTT,TTT} = {no subsequent TH}. 


Then P(A) = P(B) = P(C) = 1/2, 


3 
ANBNC={HHH}, and P(ANBNC) == (5) : 


2 
But 
ANC ={HHH,HHT,HTT}, and P(ANC)= > 
while 


1 
BOC = {HHH}, and P(BNC)= ¢. 


Or (as the majority of students’ attempts did so far), take A = J), and any 
dependent pair B, C (say, B = C with 0 < P(B) <1). Then AN BNC = 
() and 


P(AN BNC) =0=P(A)P(B)P(C), but P(BNC) # P(B)P(C). 


(ii) Suppose I choose coin 1 and you coin 2, then P(you win) = 2/5. But 
if you choose 3 then 


2 
P(you win) = +2xXs=x2> 
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Similarly, if I choose 2 and you choose 1, P(you win) = 3/5 > 1/2. Finally, if 
I choose 3 and you choose 2 then P(you win) = 3/5 > 1/2. Thus, it’s always 
better to be the second. 


Worked Example 1.2.18 (a) Let Aj,...,A, be independent events, and 
P(A;) < 1. Prove that there exists an event B, P(B) > 0, such that BNA; = 0 
for alll <i<n. 

(b) Give an example of three events Ai, Az, A3 that are dependent but where 
any two events A; and A;,i 4 j are independent. 


Solution (a) Let A® be complement of event A; then 
P(A, UU Ay) = 1-P((\_, 49) =1=][ P49 <1 
i=1 


as P(A‘) > 0 for all 7. Select B = (UAi)*, then P(B) > 0 and BUA; = 0 
for all i. 
(b) Let A; = {1,4}, Ao = {2,4} and Ag = {3,4} where P(k) = 1/4,k = 
1,2,3,4. Then P(A;) = 1/2,i=1,2,3 and 
1\2 Geeta 
P(A; N Aj) = P({4}) = (5) = P(A;)P(A;),1 <i <j <3, 
on the other hand 
1\2 1 
P(A, 7 Ag N As) = P({4}) (5) (; 


)° = P(A1)P(A2)P(45). 


Worked Example 1.2.19 You toss two symmetric dice. Let event A, 
mean that the sum of readings is s, and B; mean that the first die shows 7. 
Find all the values of s and i such that the events A, and B; are independent. 


Solution The admissible values for s are 2,...,12 and for 7 are 1,...,6. We 
have 
At if 2<s<7 
P(As) — . ; : SSS 


and 
1 
P(B) =, 1Si<6. 
Clearly, P(A; 0 Bj) = x, if 1 <i<6,i<s <i+6, and P(A,N B) =0 
otherwise. This implies that 


P(A, () Bi) TF P(A,)P(B;) 
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if and only if (a) s = 7 and 1 <i < 6, or (b) at least one of the values s or 
7 is not admissible. 

Note that the case s = 7 is special as i could take any of admissible values 
1,...,6. In all other cases the value s put some restrictions on admissible 
values of i (say, if s = 9 then 7 cannot take values 1 or 2). This destroys 


independence. 


Worked Example 1.2.20 Suppose n balls are placed at random into n 
cells. Find the probability p, that exactly two cells remain empty. 


Solution ‘At random’ means here that each ball is put in a cell with proba- 
bility 1/n, independently of other balls. First, consider the cases n = 3 and 
n =A. If n = 3 we have one cell with three balls and two empty cells. Hence, 


If n = 4 we either have two cells with two balls each (probability p/,) or one 
cell with one ball and one cell with three balls (probability p!/). Hence, 


hd tihh n (n — 2) n(n—1)] _ 21 
p= ret = (3) x 2 x [a+ Fi ar 


64 


Here 4 stands for the number of ways of selecting three balls that will go to 
one cell, and n(n —1)/4 stands for the number of ways of selecting two pairs 
of balls that will go to two prescribed cells. 

For n > 5, to have exactly two empty cells means that either there are 
exactly two cells with two balls in them and n—4 with a single ball, or there 
is one cell with three balls and n — 3 with a single ball. Denote probabilities 
in question by pj, and py’, respectively. Then p, = p), +p. Further, 


» _(m\ 1 fn\ (n-2) n-2n-3 1) iy? 
Pn~\9})* 9 \9 y) nn n\n 
_ ial n n—2 
~ Ann \2 9 ; 


Here the first factor, (3) is responsible for the number of ways of choosing 


two empty cells among n. The second factor, 


2(3)("2’): 
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accounts for choosing which balls ‘decide’ to fall in cells that will contain 
two balls and also which cells will contain two balls. Finally, the third factor, 

n-2n-—3 1 


o) 


gives the probability that n — 2 balls fall in n — 2 cells, one in each, and 
the last (1/n)? that two pairs of balls go into the cells marked for two-ball 


occupancy. Next, 
eft n\ (n—3)! 
d= (0) x@—nx (0) O=m 


Here the first factor, Gs is responsible for the number of ways of choosing 
two empty cells among n, the second, (n — 2), is responsible for the number 
of ways of choosing the cell with three balls, the third, (as is responsible 
for the number of ways of choosing three balls to go into this cell, and the 


(n— 


3)! 
last factor ) describes the distribution of all balls into the respective 


cells. 


Worked Example 1.2.21 You play a match against an opponent in which 
at each point either you or he/she serves. If you serve you win the point with 
probability pi, but if your opponent serves you win the point with probability 
p2. There are two possible conventions for serving: 

(i) alternate serves; 

(ii) the player serving continues until he/she loses a point. 

You serve first and the first player to reach n points wins the match. Show 
that your probability of winning the match does not depend on the serving 
convention adopted. 


Solution Both systems give you equal probabilities of winning. In fact, 
suppose we extend the match beyond the result achieved, until you have 
served n and your opponent n—1 times. (Under rule (i) you just continue the 
alternating services and under (ii) the loser is given the remaining number 
of serves.) Then, under either rule, if you win the actual game, you also 
win the extended one, and vice versa (as your opponent won’t have enough 
points to catch up with you). So it suffices to check the extended matches. 

An outcome of an extended match is w = (w ,...,W2n-1), a sequence of 
2n — 1 subsequent values, say zero (you lose a point) and one (you gain 
one). You may think that w,...,w, represent the results of your serves and 
Wn+1,+++;W2n—-1 those of your opponent. Define events 


A; = {you win your ith service}, and B; = {you win his jth service}. 
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Their respective indicator functions are 


1,weéA; 
0,w¢A;, 1 <i<n, 


Hoe a)=4 


and 


1, w € B; 


I(w € B;)= 
Cer, eee 


Under both rules, the event that you win the extended match is 


iw =(0 iss 25 Wey) 2 Ss" Ww, > np, 


and the probability of outcome w is 


pei Ta; Wy = pi)? Ta, (2) pos IB; sal n-1-o; Tp, (w) : 


— p2) 


Because they do not depend on the choice of the rule, the probabilities are 
the same. 


Remark 1.2.22 The w-notation was quite handy in this solution. We will 
use it repeatedly in future problems. 


Worked Example 1.2.23 The departmental photocopier has three parts 
A, B and C which can go wrong. The probability that A will fail during a 
copying session is 10~°. The probability that B will fail is 107! if A fails and 
10~° otherwise. The probability that C will fail is 107! if B fails and 10~° 
otherwise. The ‘Call Engineer’ sign lights up if two or three parts fail. If only 
two parts have failed I can repair the machine myself but if all three parts 
have failed my attempts will only make matters worse. If the ‘Call Engineer’ 
sign lights up and I am willing to run a risk of no greater than 1 percent of 
making matters worse, should I try to repair the machine, and why? 


Solution The final outcomes are 


AfBfCf, probability 10-> x 10-! x 10-1 = 107”, 

AfBwCf, probability 10-5 x 9-107! x 107° = 9-107", 
AfBfCw, probability 10-> x 10-7! x 9-107! = 9-107", 

Aw BfCf, probability (1 — 10-°) x 10-> x 107! = (1 — 10-°) - 10°. 
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So, 


P (3 fail) 

P (2 or 8 fail) 

7 P(A, B,C f) 

~ P(A, BEC w)+P(A,Cf Bw) +P(B,CfAw)+P(A,B,C f) 
7 1077 

~ 9-10-74 9 x 10-11 + (1 — 10-5) - 10-6 + 10-7 

x lige’ oe ae 

~ 9-10-74 10-4 10-7 20 ~ 100° 


P (3 parts fail|2 or 3 fail) = 


Thus, you should not attempt to mend the photocopier: the chances of 
making things worse are 1/20. 


Worked Example 1.2.24 I’m playing tennis with my parents, the prob- 
ability to win with a mum (M) is p, the probability to win with a dad (D) 
is q where 0 < q < p< 1. We agreed to play three games, and their order 
may be DMD (first, I play with dad, then with mum, then again with dad) 
or MDM. The results of all games are independent. In both cases find the 
probabilities of the following events: 


(a) I win at least one game, 
(b) I win at least two games, 


(c) I win at least two games in succession (i.e. 1 and 2, or 2 and 3, or 1, 2 
and 3), 


(d) I win exactly two games in succession (i.e. 1 and 2, or 2 and 3 but not 
1, 2 and 3), 


(e) I win exactly two games (i.e. 1 and 2, or 2 and 3, or 1 and 3 but not 1, 
2 and 3). 


Solution (a) 

Ppmp(win > 1) = 1—(1—q)*(1—p) < 1-(1—p)°(1-4) = Pup (win > 1) 
(b) 

Ppmp(win > 2) = 2pq + q?(1 — 2p) < 2pq — p?(1 — 2q) = Pupm(win > 2) 


(c) 


Ppmp(win in succession > 2) 
= 2pq — gp > 2pq — p?¢ = Pap (win in succession > 2) 
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Ppmp(win 2 in succession) 
= 2pq — 2q?p > 2pq — 2p?q = Pupm (win 2 in succession) 


(c) 
Ppmp(win) = 2pq = q°(1 — 3p), Pap (win) = 2pq + p?(1 — 39). 
Under condition p+ q > 3pq we get 


p?(1 — 3g) — q?(1 — 3p) = p* — q? + 3pq(p — g) = (p— a) (p + 4 — 3pq) > 0. 


This implies Ppyp(win) < Pypm(win). 


Worked Example 1.2.25 You have four coins in your parcel. One of 
them is special: it has heads on both sides. A coin selected at random shows 
heads in three successive trials. Find the conditional probability of obtaining 
heads in the fourth trial under the condition that the three previous trials 
showed heads. 


Solution Let the special coin be labelled 1, and the other coins labelled 2, 
3, 4. The probability space Q is the Cartesian product 


{1,2,3,4} x {HT}, 
ie. the set {7, 51,52, 53,54} where i is the coin number and $; = H or 
T,i = 1,2,3,4. We have 

1 
POL, A, A, A) = q Pi A, A, H, H) = 


Now we find the conditional probability 

P(H, H, H, H) 

P(H, H, H) 

P(1, H, H, H, H) + Sx, P(i, H, H, H, H) 
P(1, H, H, H) + S<*_, P(i, H, H, H) 


_44+35G)) +3 _ 
6 


P(H, H, H, H|H, H, H) = 


pray 4 


Worked Example 1.2.26 Let Aj, Ag,..., An (n > 2) be events in a sam- 
ple space. For each of the following statements, either prove the statement 
or provide a counterexample. 
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(i) If B is an event and if, for each k, {B, Ax} is a pair of independent 
events, then {B, J; Ar} is also a pair of independent events. 


(ii) 
(Mai) =F (4 


fe An) , provided P Ge Ax) >0. 


n 


if S*P(Ax) >n—1. then P(N Ay) > 0. 


k=1 


If $7 P(A;M4;) > (5) —1 then P(A, 4*) >0. 


i<j 


Solution (i) False. For a counterexample, consider a throw of two dice and 
fix the following events: 


Aj = {first die odd}, Ap = {second die odd}, B= {sum odd}. 


Then P(A;) = P(Ag) = P(B) = 1/2 and P(A, U Ag) = 1—P(A{UAS) = 3/4. 
Next, 


P(A, N B) = 7 = P(A1)P(B), 


P(A2 NB) = 7 = P(A2)P(B), 
indicating that individually, each of (A;,B) and (Ag, B) is a pair of inde- 
pendent events. However, as B C (A; U Ag), 


3 


P(BN(A1U Ag)) =P(B)=5 45 x5 


(ii) True, because 


I] P (Ai Ai<r<k—-1 Ar) 


2<k<n 
= P(Ag|A1)P(A3|A1 Ag) ++ P(An|A1N- ++ An-1) 


P(A M Az) P(Ai 9 Ag N Az) P(A, N-++9 An—1 An) 


o hey a N Ag) ot (A; Me A An-1) 
P(A M---N An 
= 1 P(A) —Pp lee A, Ai) F 
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(iii) True. Indeed, 


(zee At) Sass (Gees 4j) 


and 


P(Useren 48) $ DO PUD 


1<k<n 
= [1 — P(Ax)] 
1<k<n 
=n— P (Ag) ’ 
1<k<n 


implying that 


(iv) Apply the previous part to the events A; Aj; the total of which is 
n 


2 


Worked Example 1.2.27 A normal deck of playing cards contains 
52 cards, four each with face values in the set F = {A,2,3,4,5,6,7,8,9, 
10, J,Q, &}. Suppose the deck is well shuffled so that each arrangement is 
equally likely. Write down the probability that the top and bottom cards 
have the same face value. 

Consider the following algorithm for shuffling: 

(S1): Permute the deck randomly so that each arrangement is equally 
likely. 

(S2): If the top and bottom cards do not have the same face value, toss a 
biased coin that comes up heads with probability p and go back to step S1 
if head turns up. Otherwise stop. 

All coin tosses and all permutations are assumed to be independent. When 
the algorithm stops, let X and Y denote the respective face values of the 
top and bottom cards and compute the probability that X = Y. Write down 
the probability that X = x for some x € F and the probability that Y = y 
for some y € F. What value of p will make X and Y independent random 
variables? Justify your answer. 


Solution Clearly, for the top card we have 52 choices. Then for the bottom 
card we have 3 choices to satisfy the requirement. For the remaining 50 cards 


1.2 Conditional probabilities 37 


there are 50! possibilities. The total number of card arrangements equals 52!. 
So: 
52-3-50! 1 
> 
Further, the probability that the top and bottom cards have different face 
values equals 16/17. Therefore, 


P(top and bottom cards have the same face value) = 


P(top and bottom cards have different face values after 


16 1 
the first shuffling, the same value after the second one) = x px 7 


This argument can be repeated: 


P(top and bottom cards have the same face value 


1 16 16 \* 
at the time the algorithm stops) c + 4 ( r) caer 


17 17 7 
ans, — eS 
17 1-—16p/17° 


Finally, P(X = x) = P(Y = y) = 1/card F = 1/13. By symmetry, 


P(X =Y=2)=¢q forallxreF 
P(X =2,Y =y)=r forallz,yeF,rFy. 


For independence we need g = r. But 
PX =Y) = 13%9q; PEAY) =] 18x and POC = YP A ey) = sh 


Condition g = r implies that 


1 1 
=13¢= =. 


1 
13g SAO Spr Ties A 
A oat eae CRS, 13 


This yields p = 1/4. 


Worked Example 1.2.28 (a) Let A ;, Ag and A3 be three pair-wise 
disjoint events such that the union A; U Ag U Ag is the full event and 
P(A1),P(A2),P(A3) > 0. Let E be any event with P(E) > 0. Prove the 
formula: 

P( Ai) P(E| Az) 


DD P(As)P(EIA5) © 
j=1,2,3 


P(A,|E) = 


(b) A Royal Navy speedboat has intercepted an abandoned cargo of pack- 
ets of the deadly narcotic spitamin. This sophisticated chemical can be fabri- 
cated only in three places in the world: a plant in Authoristan (A), a factory 
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in Bolimbia (B) and the ultramodern laboratory on board of a pirate subma- 
rine Crash (C) cruising ocean waters. The investigators wish to determine 
where this particular production comes from, but in the absence of prior 
knowledge they have to assume that each of the possibilities A, B and C is 
equally likely. 

It is known that a packet from A contains pure spitamin in 95 percent 
of cases and is contaminated in 5 percent of cases. For B the corresponding 
figures are 97 percent and 3 percent and for C they are 99 percent and 
1 percent. 

The analysis of the captured cargo showed that out of 10000 checked 
packets, 9800 packets contained the pure drug and the remaining 200 were 
contaminated. That is, 98 percent of the cargo is estimated to be pure spi- 
tamin. On the basis of this analysis, the captain of the speedboat reported 
his opinion that with probability roughly 0.5 the cargo was produced in B 
and with probability roughly 0.5 it was produced in C. 

Let us assume that the number of contaminated packets follows the bino- 
mial distribution Bin(d/100, 10000) where 6 equals 5 for A, 3 for B and 1 
for C. Prove that the captain’s opinion is wrong: there is an overwhelming 
chance that the cargo comes from B. 

Hint: Let E be the event that 200 out of 10000 packets are contaminated. 
Compare the ratios of the conditional probabilities P(EZ|A), P(£|B) and 
P(£|C). You may find it helpful that In3 = 1.09861 and In5 = 1.609 44. 
You may also take In(1 — 6/100) = —6/100. 


Solution (a) Write: 


_P(ENA:) _ P(A,)P(E|Aa) 
P(AIE) = Bay = >; P(Aj)P(EIA5) 


Here one uses (i) the definition: 


_ P(ANE) _ P(A)P(E|A) 
PIE) = Bey PE) 


(the Bayes formula), and (ii) the equation 
P(E) = > P(A;)P(EIA)), 
j 


for events E and A; as above. 


(b) In the example, events A, B, C are pair-wise disjoint and P(A) = 
P(B) = P(C) = 1/3. Then, by (a), the conditional probability P(A|E) 
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equals 
P(E|A)/3 _ 1 
[P(EIA) + P(E|B) + P(E|C)| / 3 14 [P(E|B) +P(E|C)| / P(E|A) 


and similarly for P(B|E), P(C|E). Next, E is the event that 200 out of 10000 
packets turned out to be contaminated. Then 


P(E|A) ox 579°(.95)98°, P(E|B) oc 370°(.97)99, P(E|C) ox (.99)°80?, 
1 10000 
the proportionality coefficient being al 500 ) 
So, 


P(A|E) = (1 + [3200(.97)9800 4 (.99)9800) i [5209(.95)9800] es 


= ( + [c200n3¢-98x3 4°98] / [¢200 In 5 ¢—98%5] = 


an -1 
= E + (e 6 te 28) /e 168] = ( + 9? + 210) : very small, 
since, with the given values of the log: 
2001n5 — 98 x 5 = —490 + 322 = —168, 


and 
200 1n3 — 98 x 3 = —294 + 218 = — 76. 


We see that e168 « e- 8 « e~”, ie. the term P(E|B) contributes over- 
whelmingly. Hence, with probability ~ 1, the cargo comes from B. 


1.3 The exclusion—inclusion formula. The ballot problem 


Natural selection is a mechanism for generating an exceedingly high degree 
of improbability. 
R.A. Fisher (1890-1962), British statistician. 


The exclusion-inclusion formula helps us calculate the probability P(A), 
where A = A; U Ap U---U Ay, is the union of a given collection of events 
Aj,...,An. We know (see Section 1.2) that if events are pair-wise disjoint, 
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In general, the formula is more complicated: 


P(A) = P(A1) +---+ P(An) 
— P(A, M Ag) — P(ALN Az) — ++» — P(An-19 An) 
+ P(A AgN Ag) +--> + P(Ap—2M An-1M An) 
foe ()P PiAd net A, ) 


=r P(( 4s). (1.3.1) 


k=1 1<i4<-<tpcn 


The proof is reduced to a counting argument: on the left-hand side (LHS) 
we count each outcome from [J,-;<,Ai once. But if we take the sum 
Yo <i<n P(Ai) then those outcomes that enter more than one event among 
A,,...,An will be counted more than once. Now if we subtract the sum 
Vi<icj<n P(Ai 1 Aj) we will count precisely once the outcomes that en- 
ter exactly two events A),...,A,. We still will be in trouble with the 
outcomes that enter three or more events. Therefore we have to add 
i<icj<ken P(A; N A; N Ax). And so on. 

A formal proof can be carried by induction in n. It is convenient to begin 
the induction with n = 2 (for n = 1 the formula is trivial). For two events 
A and B, AUB coincides with (A\ (AM B))U(B\ (AN B))U(ANB), the 
union of non-intersecting events. Hence, P(A U B) can be written as 


P(A\ (AN B)) + P(B\ (AN B)) + P(ANB) 
= P(A) — P(AN B) + P(B) —P(ANB) + P(ANB) 
= P(A) + P(B) — P(ANB). 


The induction hypothesis is that the formula holds for any collection of n 
or less events. Then for any collection A,,...,An41 of 2 +1 events, the 


probability P (Cig Ai) equals: 


e((Uia) uae 
=P (U? Ai) +P(Ans1) —P ((U? Ai) n An+1) 
=S~yt YP (ai Ai, 


k=1 1<t1<--<ipsn 


+P(An41) —P (U; (A;n An+1)) 
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For the last term we have, again by the induction hypothesis: 


=P (UJ (Ain Ania)) = et SP (i (Ai, Ansa) 


k=1 1S <-<ipn<n 


ye, S- ( (74s) Ona) 


k=1 <p <-<ipn<n 


We see that the whole sum in the expansion for P (Cas Ai) includes all 
possible terms identified on the RHS of formula (1.3.1) for n+1, with correct 
signs. This completes the induction. 

An alternative proof (which is instructive as it shows relations between 
various concepts of probability theory) will be given in the next section, 
after we introduce random variables and expectations. 

The exclusion-inclusion formula is particularly efficient under assump- 
tions of independence and symmetry. It also provides an interesting asymp- 
totical insight into various probabilities. 


Example 1.3.1 An example of using the exclusion-inclusion formula is 
the following matching problem. An absent-minded person (some authors 
prefer talking about a secretary) has to put n personal letters in n addressed 
envelopes, and he does it at random. What is the probability pm» that 
exactly m letters will be put correctly in their envelopes? Verify the limit 


1 


lim ~mn = IE 
n—-0o em. 


The solution is as follows. The set of outcomes consists of n! possi- 
ble matchings of the letters to envelopes. Let A, = {letter k in correct 
envelope}. Then 


and so 
—r)! 1 
y Pan Nn NA) = (% ) SoZ 
iy <ig<s+ ip : 


Thus, 


P(at least one letter in the correct envelope) 


n f 4 aa 
SPW A) Sia toe n!’ 
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and 


P(no letter in the correct envelope) 
n vod. thay ek 
=1-P(UJ) Ai)=1 1+ a7 3 ESSE) at 


1 


which tends to e~* as n — oo. The number of outcomes in the event {no 


letter put in the correct envelope} equals = n!po,,,, and so 


. - ( n (n= m)!p(0,n = m)_ 


n! 


Therefore, 


1 1 1 = 
Pmn = —{P(0,n ns c 1! Be a 0 ee 


which approaches e~!/m! as n — 00. 


Worked Example 1.3.2 A total of n male psychologists remembered 
to attend a meeting about absentmindness. After the meeting, none could 
recognise his own hat so they took hats at random. Furthermore, each was 
liable, with probability p and independently of the others, to lose the hat on 
the way home. Assuming, optimistically, that all arrived home, find the 
probability that none had his own hat with him, and deduce that it is 
approximately e~!-?), 


Solution Set A; = {psychologist j had his own hat}, then the event A = 
{none had his own hat} is the complement of U,<;<,, Aj. By the exclusion— 
inclusion formula and the symmetry: 


P(A) = 1—nP(A1) + mn Seca, (as) aes (Ay Pen A 


Next, 
P(A) = (1p), 
P(A Ap) = (1— ppp, 
P(A1N--- An) = (1— pp)". 
Then 
P(A) =1— (1p) + PSP aye 


which is the partial sum of e~(-?), 
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Worked Example 1.3.3 (a) Let Ai,...,An be such events that the in- 
tersection of any k of them is empty; (V4 A,, = 9 for any 1 << <kh< 
wo. <p <n. Then 
1 n n 
— P(A;) < P(A, U-:-UA,) < P(A;). 
eg Pd) <P UU) <P) 


(b) It is certain that at least one, but no more than two of the events 
Aj,...,An occur. Given that P(A;) = pi for all 1, and P(A;()Aj;) = pe for 
all i,7 («4 7), show that 


n(n — 1) 


1l=npi- ——>5 


p2- 


Deduce that p; > 1/n and pg < 2/n. 


Solution (a) The result follows by taking expectation from both sides of 
inequality 
S © I(Ai) < (kh - 1)I(A1 U +++ U An). 


i=1 


(b) As P(A;, N---M A;,) = 0 for k > 2, the exclusion-inclusion formula 
gives 1 = np; — n(n — 1)po/2. Rearranging: 


as p2 > 0. Finally, 


Worked Example 1.3.4 A large and cheerful crowd of n junior wizards 
leave their staffs in the Porter’s Lodge on the way to a long night in the 
Mended Drum. On returning, each collects a staff at random from a pile, 
returns to his room and attempts to cast a spell against hangovers. If a 
junior wizard attempts this spell with his own staff, there is a probability p 
that he will turn into a bullfrog. If he attempts it with someone else’s staff, 
he is certain to turn into a bullfrog. Show that the probability that in the 
morning we will find n very surprised bullfrogs is approximately e?~!. 
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Solution Set A;(= Ai(n)) = {wizard i gets his own staff}. Then for all 
r=1,...,.nand1<ijp<---<i <n: 


(n—r)! 


P(A;, M Ai, N- ++ A;,) — PA 


9 
as there are (n — r)! ways of distributing the remaining staffs. So, by the 
exclusion—inclusion formula, the probability P (LU/_, Ai) is equal to 


n n 


E (2s Sader Sart 


r=1 r=1 r=0 


which tends to 1—e~' as n > oo. We can also consider similar events A;(k) 
defined for a given subset of k wizards, 1<k <n. 
Further, if P,(k) = P(precisely r out of given k get their own staff) then 


k \ (k-r)! 1 
Pk = ( : ) a Po(k-r) = “jolk tr); OST. 
Also: 
k-r it 
— k-r a. — a -1 
Po(k —r) =1—P(UEITA(k —1)) = dV ae 


as k > oo. So, P,(k) x e!/r! for k sufficiently large. Finally 


n n r 
P(all turn into bullfrogs) = P,.(n)p" & yee =e leh =e! 
r} 
r=0 r=0 


for n large enough. 


Remark 1.3.5 To formally prove the convergence )7)"_, P,(n)p” > e?! 
one needs an assertion guaranteeing that the term-wise convergence (in our 
case P,(n)p" + e~!p"/r! for all r) implies the convergence of the sum of the 
series. For example, the following theorem will do: 

If an(m) + an for alln asm — oo and |an(m)| < bn, where S>, bn < 00, 
then the sum S(m) = D0, an(m) + S = Do, an. 

In fact, P,.(n)p" = Po(n—r)p"/r! < p"/r! (= b,), and the series }7,..9 p’/r! 
converges to e? for all p (i.e. }7,. by < 00). . 

The remaining part of this section addresses the so-called ballot problem. 
Its original formulation is: a community of voters contains m conservatives 
and n anarchists voting for their respective candidates, where m > n. What 
is the probability that in the process of counting the secret ballot papers the 
conservative candidate will never be behind the anarchist? This question has 
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emerged in many situations (but, strangely, not in the Cambridge Univer- 
sity [A Mathematical Tripos papers so far). However, at the University of 
Wales—Swansea it has been actively discussed, in the slightly modified form 
presented below. 


We start with a particular case m = n. A series of 2n drinks is on offer, n of 
them are gin and n tonic. In a popular local game, a blindfolded participant 
drinks all 2n glasses, one at a time, selected at random. The participant is 
declared a winner if the volume of gin drunk is never more than that of 
tonic. We will check that this occurs with probability 1/(n + 1). 


Consider a random walk on the set {—n, -—n + 1,...,n} where a particle 
moves one step up if a glass of tonic was selected and one down if it was 
gin. The walk begins at the origin (no drink consumed) and after 2n steps 
always comes back to it (the number of gins = the number of tonics). On 
Figure 1.2 that includes time, every path X(t) of the walk begins at (0,0) 
and ends at (2n,0) and each time jumps up and to the right or down and 
to the right. We look for the probability that the path remains above the 
line X = -1. 

The total number of paths from (0,0) to (2n, 0) is (2n)!/n!n!. The number 
of paths staying above the line is the same as the total number of paths from 
(1, 1) to (2n, 0) less the total number of paths from (1, —3) to (2n,0). In fact, 
the first step from (0,0) must be up. Next, if a path (0,0) to (2n,0) touches 
or crosses line X = —1, then we can reflect its initial bit and obtain a path 
from (1,—3) to (2n,0). This is sometimes called the reflection principle. 


+1 


+y 


Figure 1.2 
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Hence, the probability of winning is 


(2n — 1)! (2n — 1)! yee 1 ee 1 


mi(n—1)!  (n+1)!(n — 2)! > ntl | ntl 


nin! Qn 

Now assume that the number m of tonics is > n, the number of gins. As 
before, winning the game means that at each time the number of consumed 
tonics is not less than that of gins. Then the total number of paths from (0, 0) 
to (n+ n,m —n) equals (m+ n)!/mln!. Again, the first step of a winning 
path is always up. The total number of paths from (1,1) to (m+n,m—n) 
equals (m+n — 1)!/(m — 1)!n!. Using the reflection principle, we see that 
the number of losing paths equals the total number of paths from (1, —3) to 
(m+n,m—n), which is (m+n-—1)!/(m+1)!(n — 2)!. Finally, the winning 
probability is 


(m+n-—1)! (m+n-—1)! (m+n)! m—nt+1 
(m — 1)!n! ol 


min! = m+1 

We apply these results to the following. 
Worked Example 1.3.6 Suppose n married couples have to cross from 
the left to the right bank of a river via a narrow bridge, one by one. They 
decided that, at any time, the number of men on the left bank should be no 


less than that of women; apart from this the order can be arbitrary. Find 
the probability that every man will cross the river after his own wife. 


Solution Set 


A = {every man crosses after his own wife}, 
B = {at all times, the number of men on the left bank > that of women}. 


Then 


P(A|B) = P(A(} B)/P(B) = P(A)/P(B) =" y 


Qn 


We conclude this section with another story about Kolmogorov. He be- 
gan his academic studies as a historian, and at the age of 19 finished a 
work focused on the principles of distribution and taxation of arable land in 
fifteenth- and sixteenth-century Novgorod, an ancient Russian state (a prin- 
cipality and a republic at different periods of its existence, fully or partially 
independent, until it was annexed by Moscow in 1478). In this work, he used 
mathematical arguments to answer the following question: was it (i) a vil- 
lage that was taxed in the first place, and then the tax was divided between 
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households, or (ii) the other way around, where it was a household that was 
originally taxed, and then the sum represented the total to be paid by the 
village? The sources were ancient cadastres and other official manuscripts 
of the period. Because the totals received from the villages were always an 
integer number of (changing) monetary units, Kolmogorov proved that it 
was rule (ii) that was adopted. 

Kolmogorov reported his findings at a history seminar at Moscow Uni- 
versity in 1922. However, the head of the seminar, a well-known professor 
(a street in Moscow was later named after him) commented that the con- 
clusions of his young colleague could not be considered final, because ‘in 
history, every statement must be supported by several proofs’. Kolmogorov 
then decided to move to a discipline where ‘a single proof would be sufficient 
for a statement to be considered correct’, i.e. mathematics. 


1.4 Random variables. Expectation and conditional 
expectation. Joint distributions 


He who has heard the same thing told by 12000 eye-witnesses has only 
12000 probabilities, which are equal to one strong probability, which is far 
from certain. 

F.M.A. Voltaire (1694-1778), French philosopher. 


This section is considerably longer than the previous ones: the wealth of 
problems generated here is such that getting through it allows a student to 
secure in principle at least the first third of the Cambridge University IA 
Probability course. 

The definition of a random variable (RV) is that it is a function X on the 
total set of outcomes 2, usually with real, sometimes with complex values, 
X(w), w € Q (in the complex case we may consider a pair formed by the 
real and imaginary parts of X). A simple (and important) example of an 
RV is the indicator function of an event: 


1, if weEA, 


0, if weA. et) 


wees we d)={ 


It is obvious that the product I(w € A1)I(w € Ag) equals 1 if and only if 
w € Ai Aa, ie. 


I(w E Aj)I(w (= Ag) = I(w € A, Ag). 
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On the other hand, 
I(w € Ay U Ag) — I(w € A) +I (w € Ag) _ I(w E A,)I(w € Ag). 


In the case of finitely many outcomes, every RV is a finite linear combination 
of indicator functions; if there are countably many outcomes, then infinite 
combinations will be needed. 

The expected value (or the expectation, or the mean value, or simply the 
mean) of a RV X taking values 21,...,2m with probabilities p1,...,pm is 
defined as the sum 


EX = Ss" Se Ss" PO = 07): (1.4.2) 


in the w-notation: 


EX = S$” p(w)X(w). (1.4.3) 


If X = bis a constant RV, then EX = bd. 
This definition is meaningful also for RVs taking countably many values 
41, £2,... with probabilities p1, p2,...: 


EX = So iti, 
a 


provided that the series converges absolutely: )°, pi|xi| < oo. If >; pi|xi| = 
oo, one says that X does not have a finite expected value. 


In many applications, it is helpful to treat the expected value EX as the 


position of the centre of mass for the system of massive points 21, 42,... 
with masses py, po,.... 

The first (and a very useful) observation is that the expected value of the 
indicator I4(w) = I(w € A) of an event equals the probability: 


Ela = S/ p(w)I(w € A) = S> p(w) = P(A). (1.4.4) 


wEQ weEA 


Furthermore, if RVs X, Y have X < Y, then EX < EY. 
Next, the expectation of a linear combination of RVs equals the linear 


combination of expectations. The shortest proof is in the w-notation: 


E(e1.X1 + 2X2) = Y- p(w) [er X1(w) + c2X2(w)| 


=C] S > p(w) X1(w) + C2 S- p(w) X2(w) = cj EX, + eEX. 
w w 


1.4 Random variables 49 


This fact (called the linearity of the expectation) can be easily extended to 
n summands: 


E ) cKrXrR | = ) CrhEX g; 


1<k<n 1<k<n 


in particular if EX; = w for every k, then ED ekeh X, = pn. A similar 


property also holds for an infinite sequence of RVs X1, Xo,...: 


E (= “Xs = Ss" CrEX,, (1.4.5) 
k k 


provided that the series on the RHS converges absolutely: 5°, |chEX,| < 

oo. (The precise statement is that if 57, |cK.EX;| < oo, then the series 

ono eX defines an RV with a finite mean equal to the sum 57,., chE-Xx.) 
Also, for a given function g: RR 7 


Eg(X) = D> p(w) 9(Xw)) = > Pig(#i): (1.4.6) 


weEQ 


provided that the sum 5°, p;|g(ai)| < oo. In fact, writing Y = g(X) and 
denoting the values of Y by y1, yo,..., we have 


EY = So yP(9(X) = ys) = Dos D, U(g(@i) = yy )P(X = 2) 
j 7G 


which is simply }°, pig(a;) under the condition that 5°, p;|g(ai)| < oo, as 
we then can do summation by grouping terms. 


Remark 1.4.1 Formula (1.4.6) is a discrete version of what is known as 
the Law of the Unconscious Statistician. See equation (2.2.12) below. 


Given two RVs, X and Y, with values X(w) and Y(w), we can consider 
events {w: X(w) = xj, Y(w) = y;} (shortly, {X = 2;, Y = y;}) for any pair 
of their values x;, yj. The probabilities P(X = x;, Y = y;) will specify the 
joint distribution of the pair X, Y. The ‘marginal’ probabilities P(X = x;) 
and P(Y = y;) are obtained by summation over all possible values of the 
complementary RV: 


P(X = ai) = P(X = 21, Y = yy), 


Yj 
P(Y = y;) = DP(X = 2, Y = y;). (1.4.7) 


vy 


We can also consider the conditional probabilities 
Ria, YH) 


Similar concepts are applicable in the case of several RVs X1,..., Xn. 


PX = aly = 4) = (1.4.8) 
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For example, for the sum X + Y of RVs X and Y with joint probabilities 
PX SY = a): 


P(X+¥=us= SY) P(X =2;, Y=y) 


Li Yj? Tit Yj=u 


=) °P(X =2,Y =u-2z) 


=S P(X =u-y,Y =y). (1.4.9) 


Similarly, for the product XY: 


PAYveu= SS Rixete yey) 


Ly Yj? Liyj=u 


PX Sa Sa) 


os 
=S°P(X = u/y,¥ =y). (1.4.10) 
¥y 


A powerful tool is the formula of conditional expectation: if X and N are 
two RVs then 


EX =E[E(X|N)]. (1.4.11) 


Here, on the RHS, the external expectation is relative to the probabilities 
P(N = nj) with which RV N takes its values n;: 


E (E(X|N)| =e )E(X|N = 1n;). 


The internal expectation is relative to the conditional probabilities P(X = 
x;|N = n,;) of values x; of RV X, given that N = n;: 


E(X|N = nj) = )_ aP(X = a4|N =n)). 
a 


To prove formula (1.4.11), we simply substitute and use the definition of the 
conditional probability: 


E [E(X|N)] =a = nj) \\aiP(X = 2i|N = nj) 
a 
7 3 a tN Sa) = Soar = a) SEX 


L445 XG 
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We see that formula (1.4.11) is merely a paraphrase of (1.2.6). We will say 
again that it is the result of conditioning by RV N. 
A handy formula is that for N with non-negative integer values 


EN =~ P(N =n). (1.4.12) 


n>1 


In fact, 


SOP(N =n) =S > SOP(N=k) =) OP(N=K) SO 1 


n>1 n>1k>n k>1 l<n<k 
=) °P(N =k)k=EN. 
k>1 


Of course, in formula (1.4.11) the roles of X and N can be swapped. For 
example, in Worked Example 1.2.18, the expected duration of the game is 


12 
EN =E(E(N|X)] =) pE (NIX =3), 
4=2: 


where X is the combined outcome of the first throw, and p; = P(X = i). 
We have: 


E(N|X =i) =1, i = 2,3,7,11,12, 
i) =1+ , += 4,5,6,8,9, 10. 
Di + P7 
Substituting the values of p; we get EN = 557/165 & 3.38. 


Further, we say that RVs X and Y with values x; and y; are independent 
if for any pair of values 


2 
- 
I 


For a triple of variables X, Y, Z, we require that for any triple of values 
Rx = 2,Y =] 92 =o) He = a) PY Hop Pl4-= 2p). This dete 


inition is extended to the case of any number of RVs Xj,..., Xn: for all 
U1, +2+5 Un: 
n 
PGS ty Xe 27 cK Son) = | [PG Se: (1.4.14) 
i=1 
Furthermore, an infinite sequence of RVs X,, Xo, ...is called independent 
if for all n the variables X1,..., Xp are independent. 


One may ask here why it suffices to consider in equation (1.4.14) the 
‘full’ product [];_, while in the definition of independent events it was 
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carefully stated that P(()Ai,) = [] P(Ai,) for any subcollection (see equa- 
tion (1.2.9)). The answer is: because we require it for any collection of values 
1,.-., tm of our RVs X1,..., Xp. In fact, when some of the RVs are omitted, 
we should take the summation over their values, viz. 


PU Xp = tang Ky = 2p) = YPC St Ko Sa Xp ST) 


as the events {X1 = 21,...,Xn-1 = Ln-1, Xn = Ln} are pair-wise disjoint 
for different 11s, and partition the event {X2 = %2,...,Xn = Ln}. So, if 


PX Sai. X Sty) = | [PGs 


then 


P(X5 = 25054) Xp = tn) =, POG a) P= 2) POG =2,) 
nal 


= P(Xp = 2)...1 


i= 
——~ 
S 
3 
I 


Bi) 


i.e. the subcollection Xo,..., Xp is automatically independent. 

However, the inverse is not true. For instance, if each of (X,Y), (X, Z) 
and (Y, Z) is a pair of independent RVs, it does not necessarily mean that 
the triple X,Y,Z is independent. An example is produced from Worked 
Example 1.2.18: you simply take I4,, [4, and I,4, from the solution. 

An important concept that now emerges and will be employed throughout 
the rest of this volume is a sequence of independent, identically distributed 
(IID) RVs X1, X2,.... Here, in addition to the independence, it is assumed 
that the probability P(X; = x) is the same for each 7 = 1,2,.... A good 
model here is coin-tossing: X,, may be a function of the result of the nth 
tossing, viz. 


1, if the nth toss produces a head 
Xp = I(nth h — ; ee 
" (nu toss orice) { 0, if the nth toss produces a tail. 
More generally, you could divide trials into disjoint ‘blocks’ of some given 
length / and think of X, as a function of the outcomes in the nth block. 
An immediate consequence of the definition is that for independent RVs 
X,Y, the expected value of the product equals the product of the expected 


values: 


However, the inverse assertion is not true: there are RVs with 


EX 
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Example 1.4.2 Consider 


Here, 


SY yPY =H) = 


53 


EXEY. (1.4.15) 


E(XY) = 


EY which are not independent. A simple example is the following. 


1 
—1, probability 3? 
1 
A= * 0; probability 3° Y= oF 4X4, 
1 
alt probability 3° 
2 
EX = 0, KY = 0, 


But, obviously, X and Y are dependent. 
On the other hand, it is impossible to construct an example of dependent 


RVs X and Y taking two values each such that 


let X,Y =0,1, with 


Here 


and independence occurs if and only if z = (~+ z)(y4 


eA Yee 


and 


The variance of RV X with real values and mean 


E(XY) = 


EXEY if and only if z= (a+ z)(y+ 2). 


VarX = 


B(X — pu)”; 


E(XY) = —ZEX + 


E(XY) = 


P(Y =0) =w4 
PY =1)=y+ 


EX? = 0. 


EXEY. In fact, 


P(X =Y =0) =u, P(X =1,Y =0) =z, 
P(X =0,Y =1)=y, P(X =Y =1)=z. 


t x, 


z, 


+ z). Next, 


EXEY = (#£+2z)(y+2), 


EX = p is defined as 


(1.4.16) 
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by expanding the square and using linearity of expectation and the fact that 
the expected value of a constant equals this constant we have 


VarX = E(X? —2Xp+ p?) = EX? — 2uEX + p? 
= EX? — 9p? + p? = EX? — (EX)?. (1.4.17) 


In particular, if X is a constant RV: X = b then Var X = b? — b? = 0. 

The variance is considered as a measure of ‘deviation’ of RV X from its 
mean. (The square root /VarX is called the standard deviation.) 

From the first definition we see that Var X > 0, ie. (EX iB < EX?. This 
is a particular case of the so-called Cauchy—Schwarz (CS) inequality 


|E(XY)|? < E|X/E|Y/?, (1.4.18) 


valid for any pair of real or complex RVs X and Y, with Y standing for 
the complex conjugation. (This inequality implies that if |X|? and |Y|? have 
finite expected values then XY also has a finite expected value.) The CS 
inequality is named after two famous mathematicians: the Frenchman A.- 
L. Cauchy (1789-1857) who is credited with its discovery in the discrete 
case and the German H.A. Schwarz (1843-1921) who proposed it in the 
continuous case. 

The proof of the CS inequality provides an elegant algebraic exercise (and 
digression); for simplicity, we will conduct it for real RVs. Observe that for 
all A € R, the RV (X + AY)? is > 0 and hence has E(X + AY)? > 0. Asa 


function of A it gives an expression 


EX AY PV SEX DENY eR, 


which is a quadratic polynomial. To be > 0 for for all A € R, it must have 
a non-positive discriminant, i.e. 


A(EXY)* < 4EX?EY?. 


A concept closely related to the variance is the covariance of two RVs 
X, Y defined as 


Cov(X, Y) = E(X — EX)(Y — EY) =E(XY) — EXEY. (1.4.19) 


By the CS inequality, |Cov(X,Y)|? < (VarX) (VarY). 

For independent variables, Cov(X,Y) = 0. Again, the inverse asser- 
tion does not hold: there are non-independent variables X, Y for which 
Cov(X, Y) = 0. See Example 1.4.2. However, as we checked before, if X and 
Y take two values (or one) and Cov(X, Y) = 0, they will be independent. 

We say that RVs with Cov(X, Y) = 0 are uncorrelated. 
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For the variance Var(X + Y) of the sum X + Y we have the following 
representation: 


Var(X + Y) = VarX + VarY + 2Cov(X,Y). (1.4.20) 


In fact, Var(X + Y) equals 


E((X —EX)+(Y —EY))’ 
—E(X — EX)’ + 2E(X —EX)(Y —EY)+E(Y -EY)’, 


which is the RHS of equation (1.4.20). 
An important corollary is that the variance of the sum of independent 
variables equals the sum of the variances: 


Var (X + Y) = Var X + Var Y. 
This fact is easily extended to any number of independent summands: 
Var(X1 +--+ + Xn) = Var Xi +--+ + VarXn. (1.4.21) 


In the case of IID RVs, Var X; does not depend on i, and if Var X; = a? (a 
standard probabilistic notation) then Var re jo i) = no’. 


On the other hand, if c is a real constant then Var(cX) = E(cX)? — 
(E(cX))? = c? (EX? — (EX)?) = c? Var X. Hence, VarnX = n? Var X. That 
is, summing identical RVs produces a quadratic growth in the variance, 
whereas summing IID RVs produces only a linear one. At the level of mean 
values both options give the same (linear) growth. 

A constant RV taking a single value (X = b) is independent of any other 
RV (and group of RVs). Therefore, Var (X + b) = VarX + Varb = VarX. 

Summarising, for independent RVs X1,Xo0,..., and real coefficients 


Var Ss” Ci Xi = oS Var Xj, 
a t 


provided that the series in the RHS converges absolutely. More precisely, if 
Se c?Var X;, < oo, then the series >), GX; defines an RV with finite variance 
yi cr VerX,, 

Finally, formulas (1.4.9), (1.4.10) in the case of independent RVs become 


C1, C2,---, 


P(X+¥=u)=) P(X =u-y)P(Y =y) 


y 
=) °P(X =2)P(Y =u-s) (1.4.22) 
y 
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and 


P(XY =u) =) UP(X =u/y)P(Y = y) 


y 
=) > P(X =2)P(Y = u/z). (1.4.23) 


Equation (1.4.22) is often referred to as the convolution formula. 


Remark 1.4.3 There is variation in the notation: many authors write 
E[X], Var_X] and/or Var(X). As we usually follow the style of the Tripos 
questions, such notation will also occasionally appear in our text. 


Concluding this section, we give an alternative proof of the exclusion— 
inclusion formula (1.3.1) for probability P(A) of a union A = Uj <;<, Ai. By 
using the indicators I4, of events A;, the formula becomes 


P= >. (2 SE (Lai, Ta,)- 


1l<k<n <i. <<ag<n 


Given an outcome w € Uj,<;<, Ai, let Aj,,...,A;, be the list of events 
containing w: w € Aj, NN Aj and w ¢ Aj; for all A; not in the list. 
Without loss of generality, assume that {j1,...,j;} = {1,...,s}; otherwise 
relabel events accordingly. Then for this w the RHS must contain terms 
E(Ia,, ---1A;,) with 1 <k <sand1 << --- < ig < 5 only. More 
precisely we want to check that 


l= diy 2eAi(@) = SS (—1)**1 ye TAs, (w) Py TA,, (w). 


1<k<s 1<t1<-<ipSs 


But the RHS is 


ys (ait (e )=1- i+ a ( é | =1-(1-1)'=1. 


1<k<s 1<k<s 
Taking the expectation yields the result. 


Worked Example 1.4.4 [arrive home from a feast and attempt to open 
my front door with one of the three keys in my pocket. (You may assume that 
exactly one key will open the door and that if I use it I will be successful.) 
Find the expected number a of tries that I will need if I take the keys at 
random from my pocket but drop any key that fails onto the ground. Find 
the expected number 6 of tries that I will need if I take the keys at random 
from my pocket and immediately put back into my pocket any key that fails. 
Find a and 6 and check that b-—a=1. 
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Solution Label the keys 1, 2 and 3 and assume that keys 2 and 3 are wrong. 
Consider the following cases (i) the first chosen key is right, (ii) the second 
key is right, and (iii) the third key is right. Then for a we have 


— Egg, + Hee eS | ee eee es =—2 
oes oe | 3° 9 oo 


For b, use conditioning by the result of the first attempt: 


1 2 
b= a a 3 x (l + b), whence 6=3. 


Here factor (1 + b) reflects the fact that when the first attempt fails, we 
spend one try, and the problem starts again, by independence. 


Remark 1.4.5 The equation for 6 is similar to earlier equations for prob- 
abilities q and q;; see Worked Examples 1.2.13 and 1.2.14. 


Worked Example 1.4.6 Let N be a non-negative integer-valued RV 
with mean ji; and variance o?, and X 1, Xo,... be identically distributed 
RVs with mean jg and variance 03; furthermore, assume that N, X1, X2,... 
are independent. Calculate the mean and variance of the RV Sy = X, + 
Xgt-::+Xy. 


Solution By conditioning, 


N n 
ESw = E[E(Sy|N)] = 5 > P(N = n)E (> x) 


n=0 


N 
— S P(N =n) Mpg = [i pl2. 


n=0 


Next, 


N n i 2 
= 5° P(N =n) | Var (sx) A: EX] 


N 
= SPIN = n)[no$ + (np2)?| 
n=0 


= [1103 + PEN? = pod + p3(ut + 07). 
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Therefore, 


Var(Sv) = ES?, — (ES)? = p02 + pR0?. 


Solution (Look at this solution after you have learnt about probability 
generating functions in Section 1.5.) Write 


$(s) = Es° = g(h(s)), 


where h(s) = Es*', g(s) = Es’. Differentiating yields ¢/(s) = g'(h(s))h'(s), 
and so 


E(Sw) = #1) = g'(h(1))h'(1) = g'(A)h'(1) = pape. 
Further, ¢”(s) = g!"(h(s))h'(s)? + g'(h(s))h" (1), and so 
E(Siv(Sw — 1)) = $"(1) = 9" (Dh)? + g/(Dh"(1) 


and 


Var(Sw) = $"(1) + (1) — (91)? = wr93 + HReT. 


Worked Example 1.4.7 Define the variance Var X of a random variable 
X and the covariance Cov(X,Y) of the pair X, Y. 
Let X1, X2,...,Xny be random variables. Show that 


n 

Var(X1 + X2+--++ Xn) = > Cov(X;, X5). 
ij=l 
Ten people sit in a circle, and each tosses a fair coin. Let N be the number 
of people whose coin shows the same side as both of the coins tossed by the 
two neighbouring people. Find P(N = 9) and P(N = 10). By representing 
N as a sum of random variables, or otherwise, find the mean and variance 
of N. 


Solution Let Y; = X; — E.X;. Then 


n n 2 n n 
Var (sx) =E (sx) =E| >> yy, |) = 5) EGY). 


ij=l ij=l 


But E(Y;Y;) = Cov (X;, Xj), hence the result. Next observe that P(N = 
9) = 0. Indeed, you can’t have nine people out of ten in agreement with 
their neighbours and just one person in disagreement, as his neighbours 
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must quote him. Further, P(N = 10) = 2 x (1/2)!°: there are two ways 
to fix the side, and (1/2)!° is the probability that all ten coins show this 
particular side. 

Finally, write N = I4, +---+J4,), where I,, is the indicator of the event 


A; = {ith person has two neighbours with the same side showing}. 


By symmetry, 


i 
EN = 10P(Ai) = 10x 7 = 25. 


Further, RVs I4,, I, are pair-wise independent if 7 is not next to 7. In fact, 


when 7 and 7 do not have a common neighbour, this is obvious, as each RV 
depends on disjoint collection of independent trials. Next, assume that 7 and 
j have a (single) common neighbour. Then 


1 1 1 
P(I4,=1,l4,=1)=—+-—=—. 
Ua = 1a == 35 +35 = 76 
At the same time, 
1 
P(L4, = 1) = Pla, = 1) = 7. 
Thus, 
Var(N) = 10 Var(L4,) + 20 Cov(Ly,, La.) 
i a 1 
=10/--|[- 20 |E(I4,l4,) - | . 
hy ]omfeeata-t 
Further, 
1 1 1 
B(La, 1a.) = Pla, = Ula, = PUA, = 1)= 5 x 7 = 9 
Thus, 


3 j Das | 
N=10x —+20{ — — — ) =3.125. 
Var ee aa o(5 a) 3.125 


Worked Example 1.4.8 Let Xj,...,X, be ID RVs with mean pu and 
variance o”. Find the mean of 


n n 
> aa ol 
= ) (X= x), where X = — ) Xe 
i=1 1 
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Solution First, consider the mean value and the variance of X: 


EX = 78 me Sy ESS ey, 


1 


and 
2 


> 1< o 
Var X — en ps = ae 


n n n 
ES=E (> X? -2X N° Xi + =) =E (>: Xone =) 
i=1 2 1 


i=1 


n n 
=—E bs x= =) = ae xan EX” 
i=1 


i=1 
= n(p? + 0”) — n[(EX)? + VarX] 


=n(p? +07) —n G + ~) =o'(n—1). 


Worked Example 1.4.9 Let (X;) be a sequence of IID positive RVs, 
where E(X;,) = a and EX,' = b exist. Let S, = S7%L, Xi. Show that 
E(Sin/Sn) = m/nifm <n, and E(Sin/Sn) = 1+(m—n)aE (S71) ifm > n. 


Solution For m < n write 


5 (Sm\ gp Xite t+ Xm wp (Xi) _. (M1 
(Ge) =e Sn = Doe (=) =me (F). 


=1+(m-—n)aE (| — 


The mean value E(1/S,,) is finite (and is < b) since 1/S, < 1/X}. 


It is important to stress that S,, and S,, are not independent. 
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Worked Example 1.4.10 Suppose that X and Y are independent real 
random variables with |X|, |Y| < K for some constant Kk. If Z = XY show 
that the variance of Z is given by 


VarZ = VarX VarY + VarY (EX)? + VarX(EY)?, 


stating the properties of expectation that you use. 


Solution Write Var(XY) = E(XY)?—(E(XY))? = EX?EY?—(EX)?(EY )? 
and continue 

= [Var X + (EX)?][VarY + (EY)?] — (EX)?(EY)? 

= VarX VarY + (EX)?VarY + (EY)?VarX. 


The facts used are: linearity of expectation, the equations EZ = EXEY and 
EZ? = EX°EY? that hold due to independence, and finiteness of all mean 
values because of the boundedness of random variables X and Y. 


Worked Example 1.4.11 Let a1, ao,...,a, be yearly rainfalls in Cam- 
bridge over the past n years: assume aj, G@g,...,@,, are IID RVs. Say that k 
is a record year if ay, > a; for all i < k (thus the first year is always a record 
year). Let Y; = 1 if 7 is a record year and 0 otherwise. Find the distribution 
of Y; and show that Yj, Yo,..., Y, are independent. Calculate the mean 
and variance of the number of record years in the next n years. 

Set N = j if 7 is the first record year after year 1, 1 < 7 < n, and 
N = nif a, or ay are maximal among aj,...,@p (i.e. the first or the last 
year produced the absolute record). Show that EN — oo as n > oo. 


Solution Ranking the a;s in a non-increasing order generates a random per- 


mutation of 1,2,...,n; by symmetry, all such permutations will be equiprob- 
able. Ranking the first 7 rainfalls yields a random permutation of 1,2,...,7. 
Hence, 
1 L— 1 1 1 
P(¥; = 1) = =, P(Y; =0) = — , EY; = —, Var¥; = —-— 3. 
4 i a ia 


Observe that RVs Y,,..., Y;, are independent. In fact, set 
X; = the ranking of year 1 among 1,...,72, 


with 
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Then Y; is a function of X; only. So, it is enough to check that the X; are 
independent. In fact, for any collection of pair-wise distinct values 1;, 1 = 


1,...,n, with 1; < i, the event (X; =1;, 1 <i <n) is realised for a unique 
permutation out of n!. Thus, 
1 
— =P(Xj=a;,1<isn)= [][ P(Xi=2) 
n! 
1<i<n 
Next, 
n n 1 
it y; — Sor 
Dee 
1 1 
and 


n 


Var (>>) Sia, =o -- 2) . 
1 i=1 


1 


m—2 1 
x 


Finally, for m = 2,...,n, the probability P(N > m) equals 
2 — 
3 m—-1  m-1 


1 
P(2,...,m-—1 are not records) = 1 x 5 x 


Hence, EN > $*>" _,1/(m — 1) which tends to oo as n — oo. Observe, 
however, that with probability 1 there will be infinitely many record years 
if observations are continued indefinitely. 


Worked Example 1.4.12 An expedition is sent to the Himalayas with 
the objective of catching a pair of wild yaks for breeding. Assume yaks are 
loners and roam about the Himalayas at random and that the numbers of 
males and females are equal. The probability p € (0,1) that a given trapped 
yak is male is independent of prior outcomes. Let N be the number of yaks 
that must be caught until a pair is obtained. 

(i) Show that the expected value of N is 1+ p/q+q/p, where q=1-p. 

(ii) Find the variance of N. 


Solution (i) Clearly, P(N =n) = p"-'q+q""'p for n > 2. Then EN equals 


= = 1 1 1 1 
n—1 n-1 _ | | 


which gives 1 + (p/q) + (q/p). 
(ii) Further, 


VarN = EN(N —1)+EN - (EN)? 
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and 


= pay nl (n — 1) pr +g) n(n — 1)g? 
2 


2 ee ea! ar 
Ce ae ee) a ae 


So, the variance of N equals 


2 
2p 2 oe! 
ee eeeae all (2+%41) =44+5-2-4-4 


q p? 


Worked Example 1.4.13  Liam’s bowl of spaghetti contains n strands. 
He selects two ends at random and joins them together. He does this until 
there are no ends left. What is the expected number of spaghetti hoops in 
the bowl? 


Solution Setting 


ye 1, if the ith join makes a loop, 
ee 0, otherwise, 


find 
1 


~ n—2+1 

(By the ith join you have 2n — 2(i — 1) ends untied; for an end chosen there 
are 2n — 21+ 1 possibilities to choose the second end, and only one of them 
leads to a hoop.) Thus 


n 


n n-1 
IE SEK = PG =1) = cycA a 


i=l 71 4=1 1=0 


Worked Example 1.4.14 Sarah collects figures from cornflakes packets. 
Each packet contains one figure, and n distinct figures make a complete set. 
Show that the expected number of packets Sarah needs to collect a complete 
set is 


ae | 
i=1 


Solution The number of packets needed is 


N=14+¥14+---+¥n-1. 
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Here Y; represents the number of packets needed for collecting the second 
figure, Y3 the third figure, and so on. Each Y; has a geometric distribution: 


P(Y; =s)= (Z)" DA, 


Hence, EY; = n/(n —j). Then 


n-1 
EN=n ) 
j=0 


‘fi sa) 
-=ny oie ninn. 
n—j aE 


Worked Example 1.4.15 The RV WN takes values in the non-negative 
integers. Show that N has mean satisfying 


E(N) = 5 P(N > k), 
k=0 


whenever this series converges. 

Each packet of the breakfast cereal Soggies contains exactly one token, 
and tokens are available in each of the three colours blue, white and red. 
You may assume that each token obtained is equally likely to be of the 
three available colours, and that the (random) colours of different tokens 
are independent. Find the probability that, having searched the contents of 
k packets of Soggies, you have not yet obtained tokens of every colour. 

Let N be the number of packets required until you have obtained tokens 
of every colour. Show that E(NV) = 11/2. 


Solution Write 


(oe) 


SOP(N =n) 01=S°S 0 P(N =r) = S P(N 2 n) =) P(N > Bi). 
n=1 r=1 n=1 k=0 


n=l1r=n 
Further, the probability P(N > k) equals 
P(not yet obtained after & trials) = P(all red or blue after k trials) 


+ P(all red or white after / trials) + P(all blue or white after k trials) 
— P(all red) — P(all blue) — P(all white) = 3 [(2/3)* — (1/3)*] . 


Then the expected value of N equals 


E() £0] 1@)-4 


3+3 


1.4 Random variables 65 


Example 1.4.16 A useful exercise is to prove formulas for the mean values 
of R = min[X, Y] and S = max [X, Y], where X and Y are independent 
RVs with non-negative integer values and finite means. Here 


ER = )- P(min[X,Y] >n) = 5 >P(X > n)P(X > n). 
n>1 n>1 


Next, as R+S5 = X + Y, the mean value ES = EX + EY — ER, which is 
equal to 


S [P(X =n) + PY Sn) - P(X > n)P(Y > n)]. 


n>1 


Remark 1.4.17 Independence plays an important role in the analysis of 
gambling and betting (which was a strong motivation for developing proba- 
bilistic concepts in the sixteenth to nineteenth centuries). For example, the 
following strategy is related to a concept of a martingale. (This term will 
appear many times in the future volumes.) A gambler bets on a sequence 
of independent events each of probability 1/2. First, he bets $1 on the first 
event. If he wins he quits, if he loses, he bets $2 on the next event. Again, if 
he wins he quits, otherwise he bets $4 and then quits anyway. In principle, 
one could stick to the same strategy for any given number of rounds, or even 
wait for the success however long it took. The point here is that success will 
eventually occur with probability 1 (as an infinite series of subsequent fail- 
ures has probability 0). In this case the gain of $1 is guaranteed: if success 
occurs at trial k, the profit is 2* and the total loss 1+2+---+2'-! = 2-1, 

In general, the gambler could also bet different amounts 5}, S2,... on 


different rounds. It is easy to see that the expected gain EX, will be zero 


after any given number of rounds n. For instance, after three betting rounds 
the expected gain is 


a ee Jee Sit 52+ 53 _ 4 
2 a © 8 8 — 


In general, this fact is checked by induction in n. It seems that it contra- 


dicts with the previous argument that amount $1 could be obtained with 
probability 1. However, that will occur at a random trial! And although the 
expected number of trials until the first success is 2, the expected capital 
the gambler will need with doubled bets S$; = 2° is 


i \* =i 
¥ (3) H=dLg=e 
k=1 2 ae 


(This is known as the St Petersburg Gambling Paradox: it caused conster- 
nation in the Russian high society in the early nineteenth century.) 
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It would be natural for the gambler to try to minimise the variance Var Xp, 
of his loss after n rounds. Again, a straightforward calculation shows that 


for n = 3: 
VarX3 = at ($2 — $1)” (S3 = S2— $1)? (S$) +95 + $3)? 
2 4 8 3 
62. G2 
— @2 2 3 
= St+ > + ae 


This is minimised when the bets are placed in increasing order, i.e. Sy < 
S2 < $3. The same is true for any n. 


Worked Example 1.4.18 Hamlet, Rosencrantz and Guildenstern are flip- 
ping coins. The odd man wins the coins of the others; if all coins appear alike, 
no coins change hands. Find the expected number of throws required to force 
one man out of the game provided Hamlet has 14 coins to start with and 
Rosencrantz and Guildenstern have 6 coins each. 

Hint: Look for the expectation in the form Kilmn, where l,m and n are 
the numbers of coins they start with and K depends only on 1+ m+n, the 
total number of coins. 


Solution The conditions of the game are that the total number of coins is 
constant. Hence, if H denotes the number of Hamlet’s coins, R of Rosen- 
crantz’s and G of Guildenstern’s then 


H+R+G=2%. (1.4.24) 


The game corresponds to a random walk on integer points of a three- 
dimensional lattice satisfying equation (1.4.24), with H > 0, R > 0 and 
G > 0. The game ends when at least one of the inequalities becomes equal- 
ity. The jump probabilities are 


(H,R,G) > (1+ 2,R-—1,G-—1), | probability ; 
(H,R,G) — (7 -1,R+2,G-—1), | probability - 
(H,R,G) > (HW —-1,R-—1,G+2), probability _ 
(H, R,G) > (A, R,G), probability 7 


The walk starts at H= 14, R=G=6. 
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Let Ex,r,G be the expected number of throws to reach the end if the start- 
ing amounts are H, R, G. Then, by the formula of conditional expectation, 


1 1 
En.rg = ne + Ex.r,q) + ne + Ex+42,R-1,G-1) 


1 1 
te 7 + Ex-1,R-1,642) + ne + Ey-1,R+2,G-1), 


where we condition on the first throw. That is 


3 1 
{ERG -l= q(Ea+2,R-1,6-1 + Fy-1,R42,G-1 + Ex-1,R-1,642); 
with the boundary conditions Eo,r.g = Exo,¢ = Expo = 0. 
The conjectured form is Ey. rpg = K(H +G+ R)HGR whence 
4 
3(H +G+ R-2) 


K= 


and 
4x14x6x6 336 


3x (26-2) 12 


= 28. 


F466 = 


Remark 1.4.19 A suggestive guess is as follows: Ey,rq is a symmetric 
function of H, R,G. Hence, it must be a function of symmetric polynomials 
in H, R,G, viz. 


H+R+G, HG+HR+GR, HRG, 
etc. The boundary condition yields that H RG should appear as a factor. 


Worked Example 1.4.20 We play the following coin-tossing game. Each 
of us tosses one (unbiased) coin; if they match, I get both coins; if they differ 
you get both. Initially, I have m coins and you n. Let E(m, n) be the expected 


length of play until one of us has no coins left. Write down a linear relation 
between E(m,n), E(m—1,n+1) and E(m+1,n—1). Expressing this as a 
function of m+n and m—n, deduce that E(m,n) is a quadratic function 


of m and n, and hence, using appropriate initial conditions, deduce that 


E(m,n) = mn. 


Solution Proceed as above, getting 


1 
5E(m 1n+1)+ -E(m+1,n-—1) +1. 


The answer is E(m,n) = mn. 


E(m,n) = 
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Worked Example 1.4.21 In the Aunt Agatha problem (see Worked Ex- 
ample 1.2.10), what is the expected time until the clockwork orange falls off 
the table? 


Solution Continuing the solution of Worked Example 1.2.10, let Ey be the 
(conditional) expected time while starting at distance 10 x cm from the 
left end. Then Eo = E29 = 0 and 


3 2 
Ey = 5 hel + 5 eH +1, 


or Fei, = (5/2)E, — (3/2) E~_, — 5/2. For vectors ug = (E¢, E41) and 
v = (0,—5/2) we have ug = up_1A+ v. Here A is as before: 


if 0378 
care 
with the eigenvalues A; = 3/2 and zg = 1 and eigenvectors ey = (1,3/2) 
and eg = (1,1). Note that v = 5(e2 — e1). 


Iterating yields 


up = upd! +v ) A 
1<j<l-1 


and if vector ug = a e; + aze2, then 


M1 3 Fe | 
_ ig 1 £ 1 
w= (ari + ag 5(F 6), Sarl + ag 5(; Mod t)). 


Now, substituting into the first component Eo = 0 (for ¢ = 0) yields 


aj + ag = 0, 


and into the second component E29 = 0 (for ¢ = 19): 


ay oy +a,=5 Gare a2 19) =5 (ea - 20) . 


— ___ 10((3/2)?9 -1) — 100 | if 100 
eI (3/2)20 —1 . (3/2)20 — 1° 
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Bn = [to _ 3)" 1 sy 0 {()" 1 
#000)" ] |)" >] -aas 


100 
(3/2)! + 1° 
A rough estimate yields Eyg > 48. 


and 


= 50 - 


Remark 1.4.22 <A much shorter solution (sometimes adopted by students 
who feel it is correct but cannot justify it) is as follows. Let 7 denote the 
time of the fall-off. As was found in the solution of Worked Example 1.2.10, 
the position S; at time T is 


1 
20, with probability B/D +1 


After n jumps, the position is the sum of independent RVs 
Sn = So + X1+-+-+ Xn, 


where So is the initial position (in our case, equal to 10) and X; the increment 
at the jth jump: 


2 
+1, with probability —, 
X;= 2 
—1, with probability 5° 
Writing 
U, = So + (Xy HX) fees (Xn EX) = S$, — nEX, 


yields EU,, = ES. The fact (requiring use of the martingale theory) is that 
the same is true for the random time rt: EU, = ESop. Then 


1 2 3 
10 = 20 Ky ’ 
(3/2) +1 ( 2) 


100 
pe A 
(3/2)10 +1 re 


whence 


ET = 50 — 


as before. 
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Worked Example 1.4.23 Define the covariance Cov(X,Y) of two ran- 
dom variables X and Y. Show that 


Var(X + Y) = VarX + VarY + 2Cov(X,Y). 


A fair die has two green faces, two red faces and two blue faces, and the die 


is thrown once. 
Let 


if a green face is uppermost 
otherwise, 


if a blue face is uppermost 
otherwise. 


Find Cov(X,Y). 


Solution Set: wx = EX, py = EY, with F(X + Y) = EX +EY. Then 


Cov(X,Y) = E(X — ux) (Y - py), 


and 


Var(X +Y) =E(X+Y— px — py)? 
= E(X — px)? +E(Y — py)? + 2E(X — px) (Y — py) 
= VarX + VarY + 2Cov(X,Y). 


Now 
1 
P(green) = P(blue) = 3? 
so 
re 1 
jx = P(green) = 3) similarly, wy = 3° 
Finally, 


Cov(X, Y) = EXY — pxuy. 


But XY = 0 with probability 1, thus EXY = 0. Hence, Cov(X,Y) = —1/9. 


Worked Example 1.4.24 Three gamblers initially possess fortunes of 
sizes a,b and c, respectively. They play a sequence of fair games such that 
in each independent round one player selected uniformly at random gives 
up a unit of wealth, and another selected uniformly at random receives this 
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unit. The game is stopped at the moment 7 when one of the gamblers is 


ruined. Prove that 
3abc 


" 
ET = 


ar[T] = 


ab+be+ca-—1 


3abc | 


at+b+c’ at+b+c 


3abc | 


2 a+b+el 


Solution Let Xyn,Yn, Zn, be the fortunes of gamblers after the round n. First 
we check that the value M, = XnYnZn + 3(at+b+¢) is a martingale. To this 


end, we write the conditional expectation 


as the following expression: 


E(Mnia|Xn 


ite Yen?) 


a + 1)(y—1)z4+ (@-1)(yt Lz+ (@ —- 1)y(z +1) 


+(x + luz 1) +2y- (e+) +2y+)E-V] 
sees (ver ate 
= £(6ryz — A(z +y+z)) + Bat+b+c) + f(at+b+c) = Mp. 


This martingale can be stopped at the moment 7 (in fact, it is uniformly 


bounded). So, abc = 


EM) =0+ 3 


ET(a+b+c). 


The derivation of the variance is more involved (cf. [142]). Set s = a+b+ce 


and check that 


2 i 
VES, Coe Ae Ae OD. GIG Le sia OD, Ge oes ee OO zon + 38n 


is a martingale. Indeed, the conditional expectation 


Y, Zn = 2) is written as 


= [e+ 1-12 


E (Ln. 1|Xn =e ae 


B= 1st @ 1) eS 1) eae) 
+27(y —1)?(2+1)+a7(yt1)?(z-1) + (+1)? (y— 12? 
+ (@ —1)°(y + 1)2? + (@ +:1)?y(z — 1)? + (@ — 1)?y(z +1)? 
a*(y + I)(z — 1)? + 2*(y — 1)(z +1)? + (2 + Dy — 122? 
+(x —1)(y+1)?2? + (@ — Iy?(z4 1)? 4 (e+: l)y?(z-1) 
x(y + 1)?(z — 1)? + e(y -—1)%(z+ 1)? 
+4(n+1)8|(@ + 1(y- Dz+ @- Dy tz + 2 1)(z +1) 
x(y+1)(2-1) + (a4 ly(z-1) + (@—- 1)y(z +1) 
5a(n 1) a(n + 1). 


The RHS equals Ly, after a thorough, but straightforward, cancellation. It 
is possible to check that this martingale can be stopped at the moment 7 
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as well. [This martingale is not uniformly bounded but Er? < oo, see [142].] 
So, 


2 1 
abc(ab + be + ca) = ELp = EL, 3° Er? + 3oEr. 
Hence, Er? = 3abe (ab + bc + ca — 1) and the expression for the variance 


follows. 


Worked Example 1.4.25 A taxi driver moves between four settlements 
W, X, Y, Z located at the edges of a rectangle. Four highways connect- 
ing these settlements run along the sides of this rectangle, dist(W,X) = 
dist(Y, Z) = 5 miles, dist(W, Z) = dist(Y, X) = 10 miles. Having delivered a 
customer the taxi driver waits for the next call. Then they go to the site the 
call came from, collect a customer and go to the new destination. The new 
call emerges from any settlement with probability ; and a customer wants 
to travel to any other settlement with probability 5 Travelling between the 
neighbouring settlements the driver always takes the shortest route. The 
distances within every settlement are very small. Let D be the distance to 
cover in order to collect and deliver a customer. Find the distribution of the 
random variable D. Find ED and Var D. 


Solution Let the lengths of rectangular sides be a = 5,b = 10,c = 5,d = 10. 
Without loss of generality assume that the taxi is at the edge between a and 
b. Then 


0+5=5 with probability 1/12 
0+b=10 with probability 1/12, 
ata=10 with probability 1/12, 
a+b=15 with probability 1/4, 
b+ b= 20 with probability 1/12, 
2a+b=20 with probability 1/6, 
a+2b=25 with probability 1/6, 
2a + 2b = 30 with probability 1/12. 


So, we have ED = 20 = 2 ED? = 480 = 4) Var D = ED? — (ED)? = 
575 


a2" 


Worked Example 1.4.26 I toss two dice and obtain the numbers 1 < 
S1,S2 <6. Let X = S$; 4+ Sp and Y = S; — Sb. 
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(a) Let both dice be symmetric, i.e. all the values 1,2,3,4,5,6 are 
equiprobable. Find the mean values and variances of X and Y. Find the 
numbers x and y such that the probabilities P(X = x) and P(Y = y) take 
their maximum and minimal values. Are the random variables X and Y 
independent? 

(b) Now suppose that both dice are not symmetric and take the val- 
ues 1,2,3,4,5,6 with probabilities p1,...,pe6 and qi,...,q6, respectively. 
Find P(X = 2),P(X = 7) and P(X = 12). Compare P(X = 7) 
and \/P(X = 2)P(X = 12) and using the arithmetic mean-geometric mean 
(AM-GM) inequality establish that probabilities P(X = 2),P(X = 
3),...,P(X = 12) cannot be all equal. 


Solution (a) For $1, $2 ~ U[1,6] we have ES; = 7/2, Var S; = 35/12. Hence, 


35 
EX = ES; + ES: = 7, Var X = Var S + Var So = e 


using independence in the second equality. Similarly, 


35 
EY = ES, — ESg =0, Var X = Var Y = a 


The maximum value of P(X = 2) is achieved when x = 7: P(X = 7) = ;. 
The minimal value of P(X = 2) is achieved when x = 2 or & = 12: 
P(X = 2) = P(X = 12) = %. Similarly, the maximum value of P(Y = y) 
is achieved when y = 0: P(Y = 0) = z The minimal value of P(Y = y) is 
achieved when y = +5: P(Y = —5) = P(Y =5) aie 

The random variables X and Y are dependent since 


P(X =12,Y =0) P(X =12)P(Y =0). 
(b) We have 


P(X =2)=pin, P(X = 12) = peg, 
P(X =7) = pide + pom + pags + Psq2 + p3qa + pags. 
If we suppose X ~U[2, 12] then 
1 — 
fi. 
Hence, P(X = 7) > 2,/pigepeqi by the AM—GM inequality. By a straight- 
forward rearrangement we obtain a contradiction: 


2 
2/Pigi Dede = 2\/ P(X = 2)\/P(X = 12) = a 
So, X could not be U[2, 12]. 


P(X = 7) > pigs + pom- 


74 Discrete outcomes 


Worked Example 1.4.27 Let X be an integer-valued RV with distribu- 
tion 


where s > 1, and 


Cs\= Soa. 


n>1 


Let 1 < pr < po < p3 < --- be the primes and let Az be the event 
{X is divisible by p,}. Find P(A,) and show that the events Ai, Ag,... 
are independent. Deduce that 


Solution Write 


BAUS OU. Ga Oo aay te aa 


n: divisible by pz 


Similarly, for all collections A,,,...,A,, (1 < ki < +++ < ky): 


P (Ag, O---M Ax.) = pe... pee = [[P (Ag); 


i.e. the events A,, Ao,... are independent. 

Finally, 1 — p,* is the probability that X is not divisible by py. Then 
1 - p,°) gives the probability that X is not divisible by any prime, 
ie. X = 1. The last probability equals 1/¢(s). 


Remark 1.4.28 The quantity ¢(s) is the famous Riemann zeta-function 
that is the subject of the Riemann hypothesis, one of the few problems posed 
in the nineteenth century that still remains unsolved. The above problem 
gives a representation of the zeta-function as an infinite product over the 
prime numbers. 

The name of G.F.B. Riemann (1826-1866), the remarkable German math- 
ematician, is also related to geometry (Riemannian manifolds, Riemannian 
metric). In the context of this book, we speak of Riemann integration, as 
opposed to a more general Lebesgue integration; see the following pages. 
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1.5 The binomial, Poisson and geometric distributions. 
Probability generating, moment generating and 
characteristic functions 


Life is good for only two things, 
discovering mathematics and teaching mathematics. 
S.-D. Poisson (1781-1840), French mathematician. 


The binomial distribution appears naturally in the coin-toss setting. Con- 
sider the random variable X equal to the number of heads shown in the 
course of n trials with the same type of coin. A convenient representation is 


X=Y,+---+Yn 
where RVs Yj,..., Yn are IID: 


v= 1, if the jth trial shows a head, 
J | 0, if the jth trial shows a tail. 


Assuming that P(Y; = 1), the probability of a head, equals p and P(Y; = 0), 
that of a tail, g = 1—p, we have 


P(X =k)= ( ‘ ) her’ O<k<n. (1.5.1) 


This probability distribution (and the RV itself) is called binomial (or (n, p)- 
binomial) because it is related to the binomial expansion 


n Y fit n 
os (j, ) a 2 = (pg) S14. 
O0<k<n 


Values of binomial probabilities are shown in Figure 1.3. 

Because of the above representation X = Y; +---+ Y,, the sum X + X’ 
of independent (n,p)- and (n’,p)-binomial RVs X and X’ is ((n + n’), p)- 
binomial. This representation also yields that 


EX = nEY; =np, VarX =n Var Yi = npq. (1.5.2) 


We write X ~ Bin(n,p) for a binomial RV X. 

Other well-known expansions also give rise to useful probability distri- 
butions. For example, if we toss a coin until the first head, the outcomes 
are numbers 0,1,... (indicating the number of tails before the first head 
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was shown). Let X denote the number of tails before the first head. The 
probability of outcome k is 


P(X =k) =pq*, k=0,1,2,..., (1.5.3) 


and equals the probability of sequence TT7'...7’'H where first k digits are 
T and the (k + 1)th one H. The sum of all probabilities (proportional to 
the sum of a geometric progression) is p/(1 — q) = 1. The ‘tail’ probability 
P(X >k)=q*, k>0. 

Not surprisingly, the RV X giving the number of trials before the first head 
is said to have a geometric distribution (or being geometric), with parameter 
q. The diagram of values of geometric probabilities is shown in Figure 1.4. 

The expectation and the variance of a geometric distribution are calcu- 


lated below: 
EX =) (l-@)kg => oe = 5, (1.5.4) 
k>1 k>1 “ 


and 


2 
VarX = EX? — (EX)? = S> kpq* - - 
k>1 P 


2 
= py So k(k—1q*? + pq S) kq** — 2 


k>2 k>1 

d*1 oq 4q 2 (a a q 
2 2 

dq@p pp p pr PY pie) 


0.05 0.10 0.15 0.20 0.25 


0.0 


x n=14,p=0.2 oO n=14,p=0.5 A n=14, p=0.7 


Figure 1.3 The binomial PMFs. 
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° gz=.05 


Figure 1.4 The geometric PMF. 


Note a special property of a geometric RV: for all m < n: 
P(X >n|X >m)=P(X >n-™m); (1.5.6) 


this is called the memoryless property. Another property of geometric dis- 
tributions is that the minimum min [X, X’] of two independent geometric 
RVs X and X’ with parameters qg and q’ is geometric, with parameter qq’: 


P(min [X, X’] > k) = (qq')*, k= 0,1,.... 


Sometimes, the geometric distribution is defined by probabilities p, = 
pq*—', for k = 1,2,... which counts the number of trials up to (and includ- 
ing) the first head. This leads to a different value of the mean value 1/p; the 
variance remains the same. We write X ~ Geom (q) for a geometric RV X 
(in either definition). 

The geometric distribution often arises in various situations. For exam- 
ple, the number of hops made by a bird before flying fits the geometric 
distribution. 

A generalisation of the geometric distribution is a negative binomial dis- 
tribution. Here, the corresponding RV X gives the number of tails before 
the rth head. In other words, X = X, +---+ X;, where X; ~ Geom (q), 
independently. A direct calculation shows that 


p(x=a)=(** 7? ) pra, b= 0,1,2,... (1.5.7) 


In fact, X = k means that there were altogether k + r trials, with k tails 
and r heads, and the last toss was a head. 
We write X ~ NegBin(q,r) for the negative binomial distribution. 
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x lambda=0.1 o lambda=1 a lambda=2 


Figure 1.5 The Poisson PMFs. 


Another example is the Poisson distribution, with parameter A > 0; it 
emerges from the expansion e* = a, */k! and is named after S.-D. 
Poisson (1781-1840), a prominent French mathematician and mathematical 
physicist. Here we again assign probabilities to non-negative integers, and 
the probability assigned to k equals \*e~*/k!. An RV X with 


Me 
P(X =k) = ae k =0,1,2,..., (1.5.8) 


is called Poisson. These probabilities arise from the binomial probabilities 
in the limit n > oo, with p = A/n > 0: 


nl a" be. 

(n — k)!k! (3) (1 7 *) 
_ a= 1) n= kL) xP. (= hae 
. nk “eG a/nhk 


which approaches \*e~* if kh. 

The last observation explains the fact that the sum X + X’ of two inde- 
pendent Poisson RVs X and X’, with parameters \ and X’, is Poisson, with 
parameter A + 2’. This fact can also be established directly. 

The graphs of values of Poisson probabilities are in Figure 1.5. 

The expectation EX and the variance Var X of a Poisson RV equal A: 


AP 5 M-v2_ \2 
Le = Le (1.5.9) 
k>0 k>0 
We write X ~ Po(A) for a Poisson RV X. 
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The Poisson distribution is widely used in various situations. A famous 
(albeit chilling) example (with which both authors of this book began their 
studies in Probability) is the number of Prussian cavalrymen killed by a 
horse kick in each of the 16 corps in each of the years 1875-1894. This can be 
perfectly fitted by the Poisson distribution! It is amazing that this example 
found its way into most textbooks until the end of the twentieth century, 
without diminishing the enthusiasm of several generations of students. 

The probability generating function (PGF) ¢(s) (= ¢x(s)) of an RV X 
taking finitely or countably many non-negative integer values n with prob- 
abilities p,, is defined as 


ox(s) = Es* = Se" = S> p(w)s* = Es*; (1.5.10) 


n we 


it is usually considered for —1 < s < 1 to guarantee convergence. A list of 
PGFs can be found in the Appendix, Table A.1; here we give a few often 
used examples. If X takes values 1 and 0 with probabilities p and 1—p, then 


ox(s)=ps+1—p; (1.5.11) 
if X ~ Bin(n, p), then 
bx(s) = ( : ) parts =[ps+(l—pP; (1.5.12) 
k=1 
if X ~ Geom (p), then 
Pp 
®x(s) = Y p(1—p)*s* = 1.5.13) 
and if X ~ Po(A), then 
Me 
@x(s) = wee = es“), (1.5.14) 
k>0 


An important fact is that the PGF ¢x(s) determines the distribution of 
RV X uniquely: if ¢x(s) = dy(s) for all0 < s < 1, then P(X =n) =P(Y = 
n) for all n. In this case we write X ~ Y. For X and Y taking finitely many 
values the uniqueness is obvious as ¢x(s) and ¢y(s) are linear combinations 
of monomials s” (i.e. polynomials); if two polynomials coincide, their coeffi- 
cients also coincide. This is also true for RVs taking countably many values, 
but here one has to deal with power series (i.e. the Taylor decomposition). 

Also, 


2 
EX = —?x(s) and EX(X -— 1) = “ox(s) , (1.5.15) 
= s=1 
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implying that 


VarX = EX(X —1)+EX — (EX)? 


2 2 
= (Jaen ox) - | Zox(9)| 


Another useful observation is that RVs X and Y are independent if and 


(1.5.16) 


s=1 


only if 
bx+y(s) = bx(s)¢y(s). (1.5.17) 


In fact, let X, Y be independent. Then by convolution formula (1.4.2), 
bx+y(s) equals 


ie ae gp aes k)s*P(Y =n—k)s”*, 


and by changing n — k++ | is equal to 


pre =w “| ew = | = $x(s)dy(s). 


The converse fact is proved by referring to the uniqueness of the coefficients 
of a power series. A short proof is achieved by observing that if X and Y 
are independent then s* and s* are independent, and hence by virtue of 
formula (1.4.15): 


Es*t+Y — Bg*s¥ = Es*Es’*. 


The use of PGFs is illustrated in the following example. 


Example 1.5.1 There is a random number JN of birds on a rock; each 
bird is a seagull with probability p and a puffin with probability 1 — p. If S 
is the number of seagulls then the PGFs ¢y(s) and ¢g(s) are related by 


s(s) = dn(ps + 1 —p). 


In fact, according to the definition, ¢s(s) = E s° = ae 8” Ps(n), with 
Ps(n) = P(S =n), and similarly for dy(s). Then 


ast) = Toe” To Pwtn) (™) aay 


- Pain » ("") (enya = yn 


= S = Py(m) (ps +(1- p))” = on (ps ap lis p))- 
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In short-hand notation: 


s(s) = E [E(s°|N)] = 2 FN )E (s°|N = m) 


= 2a PN) )(ps + (1 - p))” = dn [ps + (1—p)]- 


Similarly, for the number of puffins, 
bn—s(s) = on [1 —p)s t+]. 
Now, suppose N has the Poisson distribution with parameter yu. Then 
on(s) = eft(s—1) and dg(s) = el(pst1—p-1) — eup(s-1) | 


By the uniqueness of the probability distribution with a given PGF, S ~ 


Po(up). 
Similarly, 


bn_s(s) =e#O-PVS-D ie. N—S'~ Polp(1 — p)]. 
Also, 
on(s) = $s(s)on-5(5), 


i.e. S and N — S are independent. 
Of course, one could proceed directly to prove that S ~ Popy), N-—S ~ 
Po[(1 — p)u] and S and N — S are independent. First, 


=) P(S =k|N = n)P(N =n) 
n>k 


n! k n—k eee 
= 1 
Di a Dr 


a ko-Pe e HOP) (1 — p)]r-* ke-PH 
= ea y te a = way . 


n—k>0 
and similarly P(N — S =k) = [(1- p) ple" G—P)# [kl Next, 
P(S=k,N-S=1)=P(S=k|N=K+)P(N=K+4)) 


k+l k+le—b 
(FE) tant 


(k+1)! 
_ (pp)FeP# (1 — p)p)'e HC?) 
= kl i! 


i.e. S and N — S are independent. 
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An elegant representation for S and N — S is: 


N N 
S=S\°X%, N-S=)J/(1-X)), 
t=1 a1 


where Xj, X2,... are IID RVs taking values 1 and 0 with probabilities p 
and (1 — p), respectively. It is instructive to repeat the above proof that 
S~ Po(py) and (N — S') ~ Po[(1 — p)u] by using this representation (and 
assuming of course that N ~ Po()). 

An interesting fact is that a converse assertion also holds: here the use 
of PGFs will be very efficient. Let N be an arbitrary RV, let p = 1/2 and 
suppose that RVs S and N — S are independent, without assuming that 
either of them is Poisson. By independence, symmetry and the equation 
os(s) = on (ps + (1 —p)): 


1 1 


bn(s) = $5(s)¢n-s(s) = lon (5° $1- aie 


Then the function ®(s) = gy (1 — s) satisfies 


8\]2 8 \]2" 
. =. | (=)] oe ehee | (=) : 
(s) 5 oa 
Thus, if s is small, the Taylor expansion yields 


®(s) = on (1-8) = on (1) — by(1)s + O(s*) = 1 — ps + O(s?), 


where 4: = EN. Hence, as n > co 


2\ 72" 
8(s) = c KE 0 (=)| oH, 


Thus, ¢n(1 — s) = e“(—), ie. N ~ Po(y). Then, by the above argument, 
S and N — S will be Poisson of parameter 1/2. 


The function 


Mx (0) = x(e®) = Ee’* (1.5.18) 


is called the moment generating function (MGF). It is considered for real 
values of the argument 6, but may not exist for some of them. If X takes 
non-negative values only, Mx (@) exists for all 6 < 0 and possibly for some 
? > 0. Then one can also use the function 


Lx(0) = ¢x(e~*) = Ee, (1.5.19) 
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which is called the Laplace transform. The name ‘moment generating func- 


tion’ comes from the representation Mx (0 
EX” is called the nth mome 


expected value 


et 
nt of RV X. 


EX”) a” /n! as the 


On the other hand, the characteristic function (CHF) w(t) (= wx(t)) can 


be correctly defined for all real t: 
w(t) Ret = >" pj = 
J 


Y pw), 


wEQD 


(1.5.20) 


though it takes complex values, of modulus < 1. The usefulness of these two 


functions can be seen from the following properties. 


(i) 


(iii) 


The MGFs and CHFs can be effectively used for RVs taking real val- 
ues, not necessary non-negative integers. For many important RVs the 
MGFs and CHFs are written in a convenient and transparent form. The 
CHFs are particularly important when one analyses the convergence of 
RVs. Of course, u(t) = Mx (it) and Mx(0) = ¢x(e’). 

The MGF's and CHF's multiply when we add independent RVs: if X 
and Y are independent then, by equation (1.4.15), 


Mx+y(0) = Mx(@)My(@),  Yxsy(t) = vx (t)dy(t). 


In fact, if X and Y are independent then e?* and e®” are indepen- 
dent, and 


Mx+y(@) = 


and similarly for the characteristic functions. 


(1.5.21) 


ReA(X+Y) — FOX OY — KE 


He** 


ReoY = Mx (0)My (8), 


The expected value EX and the variances Var X (and in fact other 


moments E.X’) are expressed in terms of their derivatives at t = 0, viz. 
d ld 
EX = —Mx(6) = ~—yx(t) : (1.5.22) 
dé p90  idt <0 
a? d : 
6=0 
a? d : 
= ( 7 wx(t) 4 ext) Le (1.5.23) 


Each of Mx(@) and wx(t) uniquely defines the distribution of the RV: 
if Mx (6) = My(@) or x(t) = wy (t) in the entire domain of existence 
then variables X and Y take the same collection of values and with 
the same probabilities. In this case we write, as before, X ~ Y. This 
property also ensures that if Mx+y(0) = Mx(0)My(6) or vx+y(t) = 
wx(t)vy(t) then RVs X and Y are independent. 
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Property (iv) is helpful for identifying various probability distributions; 
its full proof is beyond the limits of this volume. 


Worked Example 1.5.2 Let N be a discrete random variable with 


P(N =k)=(1—p)*“p, k=1,2,3,..., 


where 0 < p < 1. Show that EN = 1/p. 


Solution A direct calculation: 


a d 1 1 
EN = k(1—p)*!p= ( Jz : 
ml ils aces ral (cea cry 


Pp 


Alternatively: EN = 1x p+(1-—p) x (1+EN), implying that EN = 1/p. 


Geometric always keeps distance from Poisson. 
Because they are both discrete. 
(From the series ‘Why they are misunderstood’. ) 


Worked Example 1.5.3 (i) Suppose that the RV X is geometrically 
distributed, so that P(X = n) = pq”, n = 0,1,..., where gq = 1 — p and 


0< p< 1. Show that 
(easy er 
E = np. 
Coen ae 


(ii) Two dice are repeatedly thrown until neither of them shows the number 
6. Find the probability that at the final throw at least one of the dice shows 
the number 5. 

Suppose that your gain is the total Y shown by the two dice at the last 
throw, and that the time taken to achieve it is the total number N of throws. 


(a) Find your expected gain EY. 


(b) Find your expected rate of return E(Y/N), using the approximation 
In(36/25) ~ 11/30. 
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Solution (i) Observe that N = X +1 with probability of success p = 25/36. 
Then 


n>1 n>1 
=25 fertannt Po 
Coca Cj Lae 
1 
1 
ae Gre 
Gd Jp U 


(ii) By symmetry, 


P(at least one 5 at the last throw) 
P(at least one 5 and no 6 at the last throw) _ 9 
P(no 6 at the last throw) 25 


Next, Y = Y, + Yo, where Y; is the number shown by die 7. Hence, (a) 


EY = 2EY), EY; = (1+2+344+5)/5=3, EY =6. 


Now, N ~ Geom (q), with parameter g = 11/36. Furthermore, Y and N are 
independent: 


P(Y =y,N =n) =q" 'pP(total y by the nth throw|no 6) 
= PY = )PN =n): 


So, (b) 


Worked Example 1.5.4 (i) Suppose that X and Y are discrete random 
variables with finite mean and variance. Establish the following results: 


(a) EX +Y) =EX+EY. 
(b) Var(X + Y) = VarX + VarY + 2Cov(X,Y). 
(c) X and Y independent implies that Cov(X,Y) = 0. 


(ii) A coin shows heads with probability p > 0 and tails with probability 
q=1-p. Let X, be the number of tosses needed to obtain n heads. Find 
the PGF for X; and compute its mean and variance. What is the mean and 
variance of X,,? 
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Solution (i) (a) Write 

EX =) aP(X =a), EY =) 0bP(Y =0) 
a 


and observe that 
{on X =a SuUsfor XxX Hay —o} 
where the union is pair-wise disjoint. Then 
EX +EY = 5° aP(X =a, Y = 6)+bP(X =a, Y =6) 
a,b 


=o Peay 9) 


c a+b=c 
=S°@P(X+¥ =0c) =E(X+Y). 
c 


(b) By definition, Var(X) = E((X — EX)*) = EX? — (EX)?. Then 
Var(X + Y) =E(X + Y —(EX +EY))? 
= E(X — EX)?+E(Y — EY)? 
+ 2E(X —EX)(Y — EY) 
= VarX + VarY + 2Cov(X,Y). 
(c) If X,Y are independent, 


ExXY = cP(XY =e) =S cS P(X =a; Y=) 


ab=c 


=)" abP(X = a)P(Y = b) 
a,b 


= bs aP(X = 0] 


= EXEY. 


Now if X, Y are independent then so are X — EX, Y — EY, and so 
Cov(X, Y) = E(X — EX)(Y —EY) =0. 

(ii) We have P(X, = k) = pq*~', and so the PGF ¢(s) = ¢x,(s) = ps/ 

(1 — qs) and the derivative ¢'(s) = p/(1 — qs)”. Hence, 

1 


1 
ie eu 5 EX? = 4"(1) + ¢/(1) = a and VarX1 = -s 


So EX, = n/p, Var Xn = nq/p?. 


1.5 The binomial, Poisson and geometric distributions 87 


Worked Example 1.5.5 Each of the random variables U and V takes 
the values +1. Their joint distribution is given by 


P(U =1) =P(U =-1) - 
P(V =1|U =1) =P(V =-1|U = -1) = _ 
P(V =-1|U =1) =P(V =1|U =-1) -. 


(i) Find the probability that 2? + Ux + V =0 has at least one real root. 

(ii) Find the expected value of the larger root of 2? + Ux +V =0 given 
that there is at least one real root. 

(iii) Find the probability that x?+(U+V)x+U+V =0 has at least one 
real root. 


Solution Write 


P(U =1,V =-1) 


(i) 227 +Ua+V has a real root if and only if U? — 4V > 0 which means 
V =-—1. Clearly, if V = —1, then U? — 4V = 5. So the probability of a real 
root is 1/2. 

(ii) The expected value of the larger root is 


Clt+ Vey ay = 1) 4 eas 


2 
1 (Seo, oo \ = 9 1 
=| = 


(U =-1|V =-1) 


~ PV =-1) a aan 2 6 


(iii) 22 + Wa + W has areal root if W7-4W >0.IfW=U+V,W 
takes values 2,0,—2 and the equation has a real root if W = 0 or —2. Then 
P(W = 0) = 2/3 and P(W =0or — 2) =5/6. 


Worked Example 1.5.6 In a sequence of independent trials, X is the 
number of trials up to and including the ath success. Show that 


PX = 7) = ( me ) Par = aa... 


Verify that the PGF for this distribution is p*s“(1—qs)~*. Show that EX = 
a/p and VarX = aq/p*. Show how X can be represented as the sum of 
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a independent random variables, all with the same distribution. Use this 
representation to derive again the mean and variance of X. 


Solution The probability P(X =r) is represented as 


P({a—1 successes occurred in r — 1 trials} M {success on the rth trial}) 


- r—l ptigr-l-atly as aoa pg *, i i > a. 
a—1 a1 


The PGF for this distribution is 


0(s) | aa ) ay =r" S( sr ) a 


r>a 
k+a-1 
a-1l 


coincides with the number of ways to represent & as a sum of a non-negative 


Observe that 


integers 
k=ky+---+Keg. 


This fact may be geometrically interpreted as follows. You take k ‘stars’ and 
you intersperse them with a — 1 ‘bits’: 


The number of stars between the (j —1)th and jth bits give you the value k;; 
k, is the number of stars to the left of the first bit and k, is the number of 
stars to the right of the (a — 1)th bit. But the number of different diagrams 


is 
k+a—1 
a—1 : 
Now use the multinomial expansion formula 
a 
(ma) =e Th 
k ko ky kar kite+ha=k j 


and obtain 


a 
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and $(s) = p%s*(1 — qs)~°. Further, 


EX = (p*s*(1 — qs)~*)'|sa1 = - 
E(X(X —1)) = (p*s*( — gs)~*)"|sa1 
20°(1—p) , a(a+1)(1 =)? 
‘ 
p Pp 


=a(a—1)+ 


Hence, 


VarX = EX(X —1)+EX — (EX)? =¢"(1) + ¢(1) -(¢())? 
a(a — 1)p? + 2a*p(1 — p) +a(a+1)(1 — p)? 44 a? aq 
p pp 


As $(s) = (w(s))*, where w(s ce ps(1 — qs)~!, you conclude that X may 
be represented as a sum we 1 Yj, where Yj,..., Yq are ID RVs with the 
PGF y(s). In fact, Y; is the number of trials between the jth and (j + 1)th 
successes including the trial of the (j + 1)th success. So, 


EX =aEY;, VarX =aVarY;. 


As Y; ~ Geom (q), EY; = 1/p and Var(Y;) = q/p”. 


Worked Example 1.5.7 What is a Poisson distribution? Suppose that 
X and Y are independent Poisson variables with parameters A and ju respec- 
tively. Show that X + Y is Poisson with parameter A+ ys. Are X and X +Y 
independent? What is the conditional probability P(X = r|X +Y =n)? 


Solution A Poisson distribution assigns probabilities 


aes 
Pr =e 
r! 
to non-negative integers r = 0,1,.... Here A > 0 is the parameter of the 


distribution. The PGF of a Poisson distribution is 


= — Pe s— 
ox(s) =Es* = Sie Ag a= eel. 
=0 : 


By independence, 


ox+y(s) = ox(s)dy(s) = eOTHE-D 
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and by the uniqueness theorem, X + Y ~ Po(A+ y). A similar conclusion 
follows from a direct calculation: 


P(X+Y=r)= yo PA Seyves) 


I 
Ms 
>< 
I 
ja] 
x 
I 
= 


3 e AX eH yd 
i! i! 
igitjar J: 
—(A+p) -A+z) 
e r ret eee i 
~ pl Ds (7) ui = r! ey 


ij: i+j=r 


However, X and X + Y are not independent, as P(X +Y = 2|X = 4) = 0. 
In fact, for all r= 0,1,...,n: 


(X= Y =n=—7) 
P(X +Y =n) 

P(X =r)P(Y =n-r) 

P(X +Y =n) 

ee tpn! 


e ATH (A + p)"rl(n — r)! 


Cr) aa) aa) 7 


ie. (X|X +Y =n) ~ Bin(n,A/A+ p). 


P(X Se X+Y =n) = 


Worked Example 1.5.8 Let X be a positive integer-valued RV. Define 
its PGF ¢x. Show that if X and Y are independent positive integer-valued 
random variables, then 


Ox+ty = Oxoy. 


A non-standard pair of dice is a pair of six-sided unbiased dice whose 
faces are numbered with strictly positive integers in a non-standard way 
(for example, (2,2,2,3,5,7) and (1,1,5,6,7,8)). Show that there exists a 
non-standard pair of dice A and B such that when thrown 


P(total shown by A and B is n) 
= P(total shown by pair of ordinary dice is n) 


for all2 <n < 12. 
Hint: c+a?+a3+at+e°+0° = o(2+1)(1+2?+24) = 2(1+2+27)(1+2). 
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Solution Use the PGF: for an RV V with finitely or countably many values 
uv: dv(«) = Ex” = So2°P(V = v). The PGF determines the distribution 
uniquely: if dy(z) = dv(x), then U ~ V in the sense that P(U = v) = 
P(V = v). Also, if V = Vj + V2, where V; and V2 are independent then 
bv (x) = ov, (x) dv, (2). 

Now let S2 be the total score shown by a pair of standard dice, then 


6 2 
5,(2) = (a) = = (> “) = san(Ita)(L+0?+a4)o(1tat2?)(1+2%), 


where S is the score shown by a single die. 
Therefore, if we arrange a pair of dice A and B such that the score T'4 for 
die A has the PGF 


1 1 
pal) = gz +2? + 2)(1+23) = g(t toe +atta° +2° + 25) 
and the score Tp for die B the PGF 


1 1 
op(z) = gz +a+e*)\(1+z) = g(t + 2x7 + Qa? + 3%) 
then the total score T, + Tp will have the same PGF as S». Hence, the die 
A with faces 1, 3, 4, 5, 6 and 8 and die B with faces 1, 2, 2, 3, 3 and 4 will 
satisfy the requirement. 


Worked Example 1.5.9 A biased coin has probability p, 0 < p < 1, of 
showing heads on a single throw. Show that the PGF of the number of heads 
in n throws is 


(8) = (ps +1—p)”. 
Suppose the coin is thrown N times, where N is a random variable with 
expectation jy and variance 0%, and let Y be the number of heads obtained. 
Show that the PGF ¢y(s) of Y satisfies 
by(s) = on(ps+1—p), 


where @y(s) is the PGF of N. Hence, or otherwise, find EY and VarY. 
Suppose P(N = k) = e>A*/k!, k = 0,1,2,.... Show that Y has a Poisson 
distribution with parameter Ap. 


Solution Denote by X the number of heads after n throws. Then X ~ 
Bin (n, p): 


P(X =k) = ( ‘ ) haa k=0,1,...,n, 


92 Discrete outcomes 


y(s) = Es” =E[E(s”|N)] =E(ps+1—p)* = on(ps+1—p), 
where $n(s) = Es. Then 

dy (8) = pby(ps +1—p), so ¢y(1) = pdy(1), ie. EY = pun. 
Further, 


oY (8) = PON (ps + 1 — p) 
and 
VarY = p’#'n(1) + pby(1) — pun = pon + wp] — p). 
Thus, if NV ~ Po (A), with 


then 
by (s) = exp [A(ps + 1 — p— 1)] = exp [Ap(s — DJ, 
and Y ~ Po (Ap). 


Worked Example 1.5.10 If X and Y are independent Poisson RVs with 
parameters \ and p, show that X + Y is Poisson with parameter + wp. 

The proofs of my treatise upon the Binomial Theorem which I hope will 
have a European vogue have come back from the printers. To my horror 
I discover that the printers have introduced misprints in such a way that 
the number of misprints on each page is a Poisson RV with parameter 4. 
If I proofread the book once, then the probability that I will detect any 
particular misprint is 1 — p independent of anything else. Show that after I 
have proofread the book once the number of remaining misprints on each 
page is a Poisson random variable and give its parameter. 

My book has 256 pages, the number of misprints on each page is inde- 
pendent of the numbers on other pages, the average number of misprints on 
each page is 2 and p = 3/4. How many times must I proofread the book to 
ensure that the probability that no misprint remains in the book is greater 
than 1/2? 
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Solution You can use the PGFs: 
= k 
2 S e7“(t2) = 
ox (t) = Rt = a” ae = ert 1) 
k>0 


’ 


and similarly ¢y(t) = e“¢-)), with dxsy(t) = x(t)dy(t) = eQtH-Y) 
owing to the independence. Then, by the uniqueness, X + Y ~ Po(A + p). 


Another way is by direct calculation: by the convolution formula, 


Tr 


P(X+¥=r)=S P(X =h)P(Y =r—k) 


k=0 
r e>\k ro a 
= k! (r —k)! 
ek ie k r—k 
=e dG=or 
k=0 
ea) 
= r) 


If X is the number of original and Z of remaining misprints then Z ~ 


Po (pA). In fact, the PGF ¢z(t) = Et? can be written as E[E(t7|X)]| and 
equals 


AXE —rA\k _& 
e r Fu Z e€ aX k!} k— 
E(t4|X =k) = i rd : 
ie a Dire ‘Hisae 
k>0 k>0 r=0 
eA \k 
=o ee 


k>0 
= ed(-l+tp—pt1) = edr(t-1). 


This implies that Z ~ Po (pA). A direct calculation also works: 


e7PAr J" e~(l-p)A Nl k-r e7PAr Jr 
_ & (pA) 3 are ~ —— 
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Finally, after n proofreadings, each Z; ~ Po (p"A), 1 <7 < 256, with p = 3/4 
and \ = 2. If R= 575 Z;, then R ~ Po (256 x 2 x (3/4)"), and 
n 1 
P(R =0) =e *!?8/4)" should be > 5: 


Hence, we must have 512 (3/4)" < In, ie. (3/4)” < (In 2)/2° or 
8ln2 —InIn2 
"= Tn(4/3) 


= 20.54939. 


Worked Example 1.5.11 State the precise relation between the bino- 
mial and Poisson distributions. The lottery in Gambland sells on average 
10’ tickets. To win the first prize one must guess 7 numbers out of 50. As- 
sume that individual bets are independent and random (this is a rather 
unrealistic assumption). For a given integer n > 0, write a formula for the 
Poisson approximation of the probability that there are at least n first prize 
winners in this week’s draw. Give a rough estimate of this value for n = 5, 10. 
The Stirling formula n! = /2nnn"™e—” can be used without proof. 


Solution The limit 


k n—k k 
_ (n\ ra d % pk 
aim, (2) (2) (1-3) = al 


means that if X, ~ Bin(n,A/n) then X, = Y ~ Po(A) as n > ov. This 
fact is used to approximate various binomial probabilities. 
In the example 


ae = 50 aa 50 
n=10, p=1/(% , and A= 10 7 J 


Further, 
Gs ) ~ 10°, \+0.1, e* 0.9 
Then 
eee A Re 
P(> n winners) = e s eee a 
k=n 
yields: 
—5 
for n=5: 0.9 x 120 7.5 x 1078, 
1 —10 
for n=10: ©0.9 x ——_ &3x10-™". 


3-106 
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The first National Lottery in the modern sense was organised in France in 
1757 by the Venetian Giacomo Casanova (1725-1798). He was an amateur 
mathematician and produced some advanced probabilistic calculations to 
persuade the French Ministry of Finance of the profitability of this enter- 
prise. Casanova believed that his memoire ‘Solution du probléme déliaque’ 
submitted for an academic prize in 1790 (where he claimed to solve the 
problem of doubling the volume of a cube) would bring him immortality. 
However, his mathematical works are completely forgotten nowadays. His 
name is mainly remembered due to his autobiography ‘Histoire de ma vie 
jusqu’a 1774’. 

J. Stirling (1692-1770) was a Scottish mathematician and engineer. Apart 
from his numerous mathematical achievements (and the Stirling formula is 
one of them), he was interested in such applied problems as the form of the 
Earth, in which his results were highly praised by his contemporaries. His 
life was not easy as he was an active Jacobite (a supporter of James II and 
his heirs, pretenders to the English throne). He had to flee abroad and spent 
some years in Venice where he continued his academic work. 

We conclude this section with a more challenging problem. 


Worked Example 1.5.12 (i) Let S, = X, +---+ Xn,So = 0, and 
X1,...,Xn be ID with 


Seis 1, probability p, 
‘| =1, probability (1 — p), 


where p > 1/2. Set = min[n: S, = 6], 6 > 0, and 7, =T. Prove that the 
Laplace transform L,(0) (cf. (1.5.19)) has the form 


1 
L,(0) = Ee = [." e29 — An(1 — p | : 
Verify equations for the mean and the variance: 
b Ap(1 — p)? 
ET, = ———,, Vi = —__. 
Tb Spat’ ar(T») (Qp—1)8 


(ii) Now assume that p < 1/2. Check that 


Prove the following formulas: 


E feo" I(r <00)] = Tis) (." 7 — An(1 »)) ; 
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and 
- Pp 
E |tTI(T << co) = 
POU el 500) 
Here and below, 
1, iff<o, 


Her <c) =| 


0, iff =o. 


(iii) Compute Var|rI(7T < oo)]. 


Solution (i) We have that 7; = ey T; where T; ~ 71, independently. Then 


ET, = bET, Var(t)) = bVar(r). 


Next, 


L,(6) = pe ° + a _ p) 7 [oO tr’ +7")) 


where 7’ and 7” are independent with the same distribution as 7. Hence, 
E [e~8(7'+7"))] = (Ee~97)? and « = L,(6) satisfies the quadratic equation: 


a = pe? + (1—p)e°x”. 
Thus, 
1 
£10) = e? e29 — An(1 — p)]; 
= 7 (-p) 
the minus sign is chosen since L,(@) — 0 when @ — co. Moreover, omitting 
subscript 7, 
1 e28 
LO) = e? 
a 2(1 — p) | a 
implying that 
1 
A) = or 
(0) pol 


Alternatively, one can derive an equation L/(0) = p+ (1 — p)(1 + 2L/(0)) 
that implies the same result. 
Using the formula Var(r) = L’"(0) — (L’(0))? we get 


Var(r) = 1 ; 2 | 1 1 
= 3 =p) [p= 1) 2@p— 18] @p—1) 
_ 8p? — 20p? + 14p—2+4+4p?-—6p+2 — 4p(1—p) 


7 2(1 — p)(2p — 1)8 ~ (2p-1)8" 
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Alternatively, we can derive an equation for y = Var(r) = E(t — Er)-: 


y = p(1—Er)? + (1— p)E(1+7' +7” -Er)?, 


where 7’ and 7” are independent RVs with the same distribution as T. This 
implies 


2 


y = p(1—Er)? + (1— p)E[1+Er+ (7! +17” — 2Er)| 
= p(1 — Er)? + (1 — p)[Var(7’ + 7”) + (1+ Er)? ], 


a 


since E(r’ + 7”) = 2Er. Finally, observe that Var(r’ + 7”) = 2Var(r) to get 


ee (eee : 
ut laa) | 


y=p(t-z44) 40-9) 


1.e. 


_ 4p(1 —p) 


(2p — 1)?” 


(ii) Observe that z = P (7 < oo) is the minimal solution of the equation 


z= p+t+(1—p)z?. The equation for c = Ele~°7I(r < 00)] takes the same 


form as in (i): 
a = pe? + (1—p)e°x". 


However, the form of the solution is different: 


E|T I(T < oo)| = : : = p 
See eer c an} (1 — p)(1 — 2p)’ 


as the square root is written differently. 


(iii) Differentiating E[e~°" I(r < 00)] twice with respect to @ we get that 
Var[rI(T < co)] equals 


1 
2(1 — p)(1 — 2p) 


2 
3 2 £ 
3 [((1 — 2p) 2(1 — 2p)* + 1] re —p)1— =| 


p(1 — 4p? + 4p*) 
(T=) (b=2p)* 
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1.6 Chebyshev’s and Markov’s inequalities. 
Jensen’s inequality. 
The Law of Large Numbers and the 
De Moivre—Laplace Theorem 


Probabilists do it with large numbers. 
(From the series ‘How they do it’.) 


Chebyshev’s inequality is perhaps one of the most famous inequalities in the 
whole probability theory. It is named after the prominent Russian mathe- 
matician P.L. Chebyshev (1821-1894). It states that if X is a random vari- 
able with finite expectation and variance then for all € > 0: 


7 1 
P(|X —EX|>e)< a varx. (1.6.1) 


Chebyshev’s inequality gave rise to a number of generalisations. One is 
Markov’s inequality (after Chebyshev’s pupil A.A. Markov (1856-1922), an- 
other prominent Russian mathematician). Markov’s inequality is that for 
any non-negative RV Y with a finite expectation, for all € > 0: 


P(Y >e)< 


ale 


LY. (1.6.2) 


Chebyshev’s inequality is obtained from Markov’s by setting Y = |X —EX |? 
and observing that the events {|X — EX] > «} and {|X — EX/? > ec} are 
the same. 

The names of Chebyshev and Markov are associated with the rise of the 
Russian (more precisely, St Petersburg) school of probability theory. Neither 
of them could be described as having an ordinary personality. Chebyshev had 
wide interests in various branches of contemporary science (and also in the 
political, economical and social life of the period). This included the study 
of ballistics in response to demands by his brother who was a distinguished 
artillery general in the Russian Imperial Army. Markov was a well-known 
liberal opposed to the tsarist regime: in 1913, when Russia celebrated the 
300th anniversary of the Imperial House of Romanov, he and some of his 
colleagues defiantly organised a celebration of the 200th anniversary of the 
Law of Large Numbers (LLN; see below). In vol. II of this book series we 


speak at length about the life and work of Markov in connection to the so- 
called Markov chains, a fundamental concept of the modern theoretical and 
applied probability theory. 
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We will now prove Markov’s inequality. This is quite straightforward: if 


the values are 71, %2,... and taken with probabilities p1, p;,..., then 
RX = 220s ye rjpj 2€ Ss pj = P(X 2 €). 
ji LjZE ji Uj ZE 


This can be made shorter using indicator [(X > e): 


EX > E[XI(X > €)| > cEI(X > €) = P(X >). (1.6.3) 


Here we also see how the argument develops: first, the inequality X > 
XI(X > €) holds because X > 0 (and of course 1 > I(X > ¢)). This implies 
that EX > E[XI(X > ¢)]. Similarly, the inequality XI(X >) > el(X > €) 
implies that E[XI(X > €)| > Eel(X > e). The latter is equal to cEI(X > €) 
and finally, to P(X > €). 

It has to be noted that in Chebyshev’s and Markov’s inequalities « > 0 
does not have to be small or large: the inequality holds for any positive value. 


In general, if g: R — R is a monotonic non-decreasing function and X a 
real-valued RV then, for all x € R with g(x) > 0 and a finite mean Eg(X) 


P(X > 2) < —E9(X); (1.6.4) 


a popular case is where g(x) = e™ with a > 0: 


1 
P(X > 2) < —Ee* (1.6.5) 


(Chernoff ’s inequality). 

The domain of applications of these inequalities is huge (and not restricted 
to probability theory); we will discuss one of them here: the LLN. 

Another example of a powerful inequality used in more than one area 
of mathematics is Jensen’s inequality. It is named after J.L. Jensen (1859-— 
1925), a Danish analyst who used it in his 1906 paper. Actually, the in- 
equality was discovered in 1889 by O. Holder, a German analyst, but for 
some reason is not named after him (maybe because there already was a 
Holder inequality proved to be extremely important in analysis and differ- 
ential equations). Let X be an RV with values in an (open, half-open or 
closed) interval J C R (possibly unbounded, i.e. coinciding with a half-line 
or the whole line), with a finite expectation EX, and g: J — Ra convex 
(concave) real-valued function such that the expectation Eg(X) is finite. 
Jensen’s inequality asserts that 


Eg(X) > g(EX) respectively, Eg(X) < g(EX). (1.6.6) 
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In other words, for all 21,...,2%n € (a,b) and probabilities p1,...,pn (with 
P1y-++)Pn =O and py +---+ pp = 1): 


nm n n n 
g » a < S| pj 9(2;) respectively, g ys > S/n; 9(2)). 
j=l j=l j=l j=l 


(1.6.7) 
Here we adopt the following definition: a function g on [a, }] is convex if 
for all x,y € [a,b] and 2 € (0,1), 


g(Ax + (1 — A)y) < Ag(x) + (1 — A)g(y), (1.6.8) 
or, in other words, for a convex function g: J — R defined on an interval 
J CR for all a1, 2 € Jand pi, po € [0,1] with pi + po = 1, 

g(piti + pox2) < pig(a1) + p2g(x2). (1.6.9) 


See Figure 1.6. 

To prove inequality (1.6.7), it is natural to use induction in n. For n = 2 
bound (1.6.7) is merely bound (1.6.9). The induction step from n to n +1 
is as follows. Write 


n+l n 
Pi 
g (S pe S Pasig (@n41) + (1 — Pnti) 9 (>: a 


i=l i=1 1— Pn4i 


and use the induction hypothesis for probabilities p;/(1 — pn4i),1 <i <n. 
This yields the bound 


Pn+ig(@n+41) + (1 — Pn+i)g (= —— ) 


40 
1 - Pott 
n n+1 
< pnsig(tns1) + >- pig(2i) = D> pig(2i). 
i=1 i=1 


If X takes infinitely many values, then a further analytic argument is 
required which we will not perform here. (One would need to use the fact 


ka y=9(x) 


xy 


Figure 1.6 
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that a convex (resp. concave) function g is always continuous in interval J, 
where it has been defined; g may be not differentiable, but only at a (at 
most) countable set of points x € J, and at each point x € J, where g is 
twice-differentiable, —g’(x) < 0 (resp. —g’(x) > 0.) 

Jensen’s inequality can be strengthened, by characterising the cases of 
equality. Call a convex (concave) function g strictly convex (concave) if 
equality in (1.6.7) is achieved if and only if either 71] = --- = Z, or all 
p; except for one are equal to zero (and the remaining to one). For such 
function g, equality in (1.6.6) is achieved if and only if RV X is constant. 

An immediate corollary of Jensen’s inequality, with g(x) = x*, x € [0, 00), 
is that for all s > 1: (EX)* < EX’, for all RV X > 0 with finite expected 
value EX’. For 0 < s <1, the inequality is reversed: (EX)* > EX*. Another 
corollary, with g(z) = nz, x € (0,00), is that 0, x;p; = TI; ay for any 
positive 21,...,2%, and probabilities p1,...,pn. For py = ++: = pn, this 


becomes the famous arithmetic-geometric mean inequality: 


1 
a ne ep sa (1.6.10) 
n 

We now turn to the LLN. In its weak form the statement (and its proof) 
is simple. 

Let X1,...,Xn be IID RVs with the (finite) mean value EX; = pb and 
variance Var X = 0”. Set 


Sy =X, +--+: + Xn. 


Then, for all « > 0, 


P(|5.o| ><) +0 as n> oo. (1.6.11) 
n 


In words, the averaged sum S;,/n of ID RVs X1,...,Xn with mean EX; = 
and Var X; = a? converges in probability to p. 


The proof uses Chebyshev’s inequality: 


| 
Ns ren 
x 
BI 
s 
ae} 
a 
les 
ka 
NYS 


which vanishes as n — oo. 
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It is instructive to observe that the proof goes through when we simply 
assume that X1, Xo,... are such that Var ()>"_, X;) = 0(n?). For indicator 
IID RVs X; with 


i babilit 
Xj;= cobain (heads and tails in a biased coin-tossing) 


0, probability 1 — p 


and EX; = p, the LLN says that after a large number n trials, the proportion 
of heads will be close to p, the probability of a head at a single toss. 

The LLN for IID variables X; taking values 0 and 1 was known to 
seventeenth- and eighteenth-century mathematicians, notably J. Bernoulli 
(1654-1705). He was a member of the extraordinary Bernoulli family of 
mathematicians and natural scientists of Flemish origin. Several generations 


of this family resided in Prussia, Russia, Switzerland and other countries and 
dominated the scientific development on the European continent. 

It turns out that the assumption that the variance o? is finite is not neces- 
sary in the LLN. There are also several forms of convergence which emerge 
in connection with the LLN; some of them will be discussed in forthcoming 
chapters. Here we only mention the strong form of the LLN: 

For IID RVs X1, Xo,... with finite mean p: 


aes ee ee 
P (aim r=) =2 


The next step in studying the sum (1.6.10) is the Central Limit Theorem 
(CLT). Consider IID RVs X1, X2,... with finite mean EX; = a and variance 
Var Xj = o” and a finite higher moment E Dara EX, |2+ = p, for some fixed 
6 > 0. The integral CLT asserts that the following convergence holds: for all 
xeER, 


Syne 1 i 2 
lim P ( = — vay, 1.6.12 
Jim ( ah <1) a dy (1.6.12) 


The map 
i 
1 2 
bireRH — [ ewa 1.6.13 
Tig y ( ) 
fog,“ 


defines the so-called standard normal, or N(0,1), cumulative distribution 
function ®(x), an object of paramount importance in probability theory 
and statistics. It is also called a Gaussian distribution function, named after 
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Karl-Friedrich Gauss (1777-1855), the famous German mathematician, as- 
tronomer and physicist, who made a profound impact on a number of areas 
of mathematics. He identified the distribution while working on the theory 
of errors in astronomical observations. Gaussian distribution fitted the pat- 
tern of errors much better than ‘double-exponential’ distribution previously 
used by Laplace. 

The values of ®(x) (or &(x) = 1—(zx)) have been calculated with a great 
accuracy for a narrow mesh of values of x and constitute a major part of 
the probabilistic and statistical tables. See the Table 1.1 below. This table 
specifies 1 — ®(x) for 0 < x < 3 with step 0.05. 


God does arithmetic. 
K.-F. Gauss (1826-1877), German mathematician and physicist. 


Gauss (some sources give his first names as Johann Karl Friedrich) was 
perhaps the last encyclopaedist in science who was able to make fundamen- 
tal discoveries in disciplines beyond mathematics. For example, his achieve- 
ments in studying magnetism were acknowledged in naming a unit of mea- 
surement of a magnetic field after him. A human brain generates a field of 
10-°-10-® gauss; the Earth’s magnetic field is between 0.3 and 0.6 gauss 
on the surface and about 25 gauss in the core; a refrigerator sticker magnet 
generates 50 gauss; a medical magnetic resonance imaging device is between 


Values of 1 — (x) Values of 1 — (x) 
x +0.00 +0.05 +0.00 +0.05 


x 
0.0 0.5000 0.4801 1.5 0.0668 0.0606 
0.1 0.4602 0.4404 1.6 0.0548 0.0495 
0.2 0.4207 0.4013 1.7 0.0446 0.0401 
1.8 
1.9 
0 


0.3 0.3821 0.3632 0.0359 0.0322 
0.4 0.3446 0.3264 0.0288 0.0256 


0.5 0.3085 0.2912 2.0 0.0228 0.0202 
0.6 0.2743 0.2578 2.1 0.0179 0.0158 
0.7 0.2420 0.2266 2.2 0.0129 0.0122 
0.8 0.2119 0.1977 2.3 0.0107 0.0094 
0.9 0.1841 0.1711 2.4 0.0082 0.0071 
1.0 0.1587 0.1469 2.5 0.0062 0.0054 
1.1 0.13857 = 0.1251 2.6 0.0047 0.0040 
1.2 0.1151 0.1056 2.7 0.0035 0.0030 
1.3 0.0968 0.0885 2.8 0.0026 0.0022 
1.4 0.0808 0.0735 2.9 0.0019 0.0016 


Table 1.1 
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15 000 and 30000 gauss (do not take your credit cards with you in this ma- 
chine, as it may lead to their deactivation, let alone a wrong diagnosis); a 
neutron star produces 10!7-10!° gauss, with some newly discovered types 
(magnetars) reaching 10!° gauss. The strongest predicted magnetic field in 
the Universe would be in the region of 10!” gauss. Gauss himself was con- 
scientious both about rigour in mathematics and the place of mathematics 
among other sciences. This led, on the one hand, to numerous discoveries 
sitting on his desk and waiting for years for their irrefutable proofs and, 
on the other hand, to declines of offers of stipends, in favour of earnings 
through practical works. The name of Gauss will often appear throughout 
this book. 
At the present stage it is useful to memorise five facts about ®(x). 


( 
1 a 2 
lim (x) =0, lim 6(r) = —— “¥Pdy=1 
lim, ®(2) =O, Jim. (2) = T= fe Pay 


(these are standard properties of a distribution function; see below). 
(ii) ®(x) = 1 — ®(—2) for all x € R, implying that 


a ener se i 


(which means that the median of the standard Gaussian distribution 
is 0 (see below)). This follows from the previous property and the fact 
that the integrand e-¥°/2 is an even function. 
(iii) 
=| eV dy = 0 
one! y 
(which means that the mean value of the standard Gaussian distribu- 
tion is 0). This follows from the above observation as ye~¥’/? is an odd 
function. 


(iv) 
=e ia yre¥ /2dy =1 
V 27 Joo 


(which means that the variance of the standard Gaussian distribution is 
1). This again can be deduced by integration by parts; again see below. 

(v) The function z +> (2) is strictly increasing with x and continuous 
in x. Hence, the inverse function ®~! is correctly defined, taking y € 
(0,1) to « € R such that ®(x) = y. The inverse function plays an 
important role in statistics. 
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The proof of the claim, in (i), that ®(co) = 1, ie. 
| eda = In 


is quite elegant. For brevity, write te for the integral i co over the whole 
line R. If fee? (2dx = G, then 


Ga fear f way = f e OH) /2dudax 
R R R2 


which in polar co-ordinates equals 


[o-e) 21 co aes 
i rene f dédr = an f re” dp = an f e d= 77. 
0 0 0 


Hence, G = V2z. Furthermore, in (iv): 
[ew Pay = (-vew?) |" + fo Pay = V2rT. 
R ao R 


There also exists a local CLT that deals with a detailed analysis of prob- 
abilities 


P (Ss vn) P(S, = na+ atpnovVn) 

on 
for a suitable range of values x, € R. The aim here is to produce asymp- 
totical expressions for these probabilities of the form e7tn/2 / V2rno which 
takes into account only the mean value a and the variance o?. 

The modern method of proving the CLT is based on the fact that the 
convergence in (1.6.12) is equivalent to the convergence of the characteristic 
functions Wg, -na)/(o/n)(t) to the characteristic function of the Gaussian 
distribution. The basics of this method will be discussed later. 

In the rest of this section we will focus on the CLT for the above example 
of coin-tossing where 


Xi, = ‘a Dipeab , independently. 
0, probability 1 — p 
Again, for this example, the CLT goes back to the eighteenth century and 
is often called the De Moivre-Laplace Theorem (DMLT), after two French 
mathematicians. One of them, A. De Moivre (1667-1754), fled to England 
after the Revocation of the Edict of Nantes, and had a notable career in 
teaching and research. The other, P.-S. Laplace (1749-1827), made signifi- 
cant contributions in several areas of mathematics. He also served (briefly 
and not very successfully) as the Interior Minister under Napoleon (Laplace 
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examined the young Napoleon in mathematics in the French Royal Artillery 
Corps and gave the promising officer the highest mark). Despite his associ- 
ation with Napoleon, Laplace was the first to vote to oust Napoleon from 
power in 1814 in the French Senate; the returning Bourbons promoted him 
from the title of Count to Marquis. 

An initial form of the theorem was produced by De Moivre in 1733. He 
was also the first to identify the normal distribution (which was named after 
Gauss almost a hundred years later). 

Informally, DMLT asserts that if X,, Xo, ...is an IID sequence, with 
P(X; =1) =p, P(X; = 0) = 1-p and hence EX; = p, Var X; = p(1 — p), 
then the RV (S,,—np)/,/np(1 — p) has, approximately, the standard N(0, 1) 
distribution. At the formal level, one again distinguishes a local and an in- 


tegral DMLT. For a given positive integer m < n, write 


Sy — np m— np 


= z,(m)], where z,(m) = —————.. 
5 ( ) (m) 


PS =m) =P (eat 
np(l — p np(1 — p) 


The formal statement of the local DMLT is: 
As n,m — co, the ratio 


1 1 
PS, =m) /\ a a exp | 5en(mn)?| 1 (1.6.14) 


as long as m—np = o(n?/9), More precisely, (1.6.14) holds uniformly inn, m 
for which the expression (m—np)n—2/3 
tends to 0. 

Recall P(S,, = m) is the probability that an event of probability p occurs 


is confined to a bounded interval and 


m times in n independent trials. 
As before, the integral DMLT deals with cumulative probabilities and 
states that for all « € R: 


j S, — np 1 ie _ 42 
lim P | —— <2] = — oY /2dy. 1.6.15 
ners ( np(1 — p) V27 J—o0 : 
Equivalently, for all —-co <a<b<o: 


_ b 
ie ee =| ede. (1.6.16) 
ne np(1 — p) vn Ja 


Although the integral DMLT looks more amenable, its proof is longer 
than that of the local DMLT (and uses it as a part). None of these theorems 
is formally proved in Cambridge IA Probability, although they are widely 
employed in subsequent courses (particularly, IB Statistics). The proof 
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below is given here for completeness. It is not used in the problems from 
this volume but gives a useful insight into how the normal distribution arises 
as an asymptotical distribution for (properly normalised) sums of indepen- 
dent RVs. 

The local DMLT can be proved by a direct argument. The main step is 
the fact that for any sequence of positive integers {m,,} such that m, <n 
and mn, N-— Mn > ~, 


Mn Mn \1 71/2 Mn 
PS =n) /'{ [anne (1 — 2) exp [nn (2, n)] b 3, 
(1.6.17) 
Here, 
in en 
n np on 3 n(1—p)’ 


n(.p) ig, Mee on (1.6.18) 


which is a particular case of a more general definition: for p* € (0, 1): 


Pp Lp 
h(p*,p) = p* In + (1—p*)In : 1.6.19 
wp) =n +0 - py nF (1.6.19) 


Convergence (1.6.14) then follows for my — np = o(n?/°) as for p* close to 
p, the Taylor expansion yields: 


no’) = (2+) =v)? FOU" — PP), (1.620) 


as 


= 0. 
p*=p 


h(p*, p) 


d 
= | —A(p’, 
Lat (= (p »)) 


Remark 1.6.1 Equation (1.6.17) and the particular form —nh(p*,p) of 
the exponent, with p* = m,/n and h(p*,p) given by formula (1.6.20), is 
important for the asymptotical analysis of various probabilities related to 
sums of independent RVs. In particular, formulas (1.6.18)—(1.6.20) play a 
significant role in information theory and the theory of large deviations. The 
function h(p*, p) is called the relative entropy of the probability distribution 
{p*,1—p*} with respect to probability distribution {p,1— p}. 


The proof of equation (1.6.20) is straightforward and uses the Stirling 
formula: 


n! = V2ann"e”. (1.6.21) 


Note that this formula admits a more precise form: 


n! = V2ann"e-"+™ | where (1.6.22) 


1 
=a 
12n+1 


However, for our purposes formula (1.6.21) is enough. 
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Omitting subscript n in mp, the probability P(S, = m) equals 


" )p™ -py™ = 7 ae - 
4 et ] m™(n — m) 


n—-m m—-—m 
= pm" 


<exp| min (n m) In —™" + minp + (n—m) In(1 — p)} - 
n n 


But the RHS of the last equation coincides with the denominator in for- 
mula (1.6.17). 
To derive the integral DMLT, we take the sum 


S > P(Sn = m)I (a < Zn(m) < b) 


and interpret it as the area between the x-axis and the piece-wise horizontal 
line representing the graph of the function 


II,: cE RW Vnp(l—p)P(S, =m), for zn(m) < ¢ < zn(m+1), 
on the interval z,,(m) < x < z,(7™m™) where m and ™ are uniquely determined 


from the condition that 


1 
a < Zn(m) < a+ —————., << z,(M) < b+ ——.. 
np(1 — p) np(1 — p) 
Of course, this area equals the integral 
/ IL, (x) da. 
Zn (m) 


There are two things here that make the calculation tricky. First, the 
integrand II,,, and, second, the limits of integration z,(m) and z,,(7™m) vary 
with n. To start with, one considers first the case where a and b are both 
finite reals. Then, because z,(m) — a and z,,(™) — b, the above integral 


differs from 
b 
/ IL, (e)dz 


negligibly. Next, in the interval (a,b) we can use the local DMLT which 
asserts that 
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With a bit of analytical work one deduces from this that 


b b 
[ t@ers | Fee ae. 


To finish the proof, we must cover the case where a and/or b are infinite. 
This is done by exploiting the fact that the convergence in the local DMLT 
is uniform when z,(m,,) is confined to a finite interval, and the limiting 
function «4 e~®/? i /2r is monotone decreasing with || and integrable. 
The details are omitted. 


I have had my results for a long time; but I do not yet know how I am to 
arrive at them. 
K.-F. Gauss (1826-1877), German mathematician and physicist. 


One important aspect of the CLT is that it provides a (fairly accurate) 
normal approximation to other distributions. 


Worked Example 1.6.2 A cubic die is thrown n times, and Y, is the 
total number of spots shown. Show that EY,, = 7n/2 and Var Y,, = 35n/12. 
State Chebyshev’s inequality and find an n such that 


Yy, 
P ( ao 3 S 0.1) < 0.1. 

n 
Solution Let Y, = en X;, where X; is the number of spots on the ith 
throw. RVs X1,...,Xn are IID, with P(X; =r) = é.r = 1,2,...,6. Hence, 


2 3 i ee 7n 30n 
EX ; = 9? Var(X;) — 6” EY n = 9”? Var(Y;,) — qo 


Chebyshev’s inequality is P (|X — EX| >) < (1/e?)VarX and is valid for 
all RV X with a finite mean and variance. 


By Chebyshev’s inequality, and independence, 


Y, 
P ( a 3 = 0.1) = P(|Y¥, — EY,| > 0.1n) 
n 
VarY, _ 35n/12 _ 1750 
~ 20.12 -0.01n? 6n ~ 
We want 1750/(6n) < 0.1, so n > 2920. 


Observe that using the normal approximation (the De Moivre—Laplace 
Theorem) yields a better lower bound: n > (1.96 x 10)?35/12 = 794. 
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Worked Example 1.6.3 A coin shows heads with probability p > 0 and 
tails with probability q = 1 — p. Let X, be the number of tosses needed to 
obtain n heads. Find the PGF for X, and compute its mean and variance. 
What is the mean and variance of X,,? 

Now suppose that p = 1 / 2. What bound does Chebyshev’s inequality give 
for P(X100 > 400)? 


Solution We have P(X; = k) = pq*~!, and so PGF ¢x,(s) = ps/(1 — qs). 
Then ¢'y,(s) = p/(1 — qs)”, and so 


=e 
p 


1 
EX y = bx, (1) = mace = 


We conclude that EX, = n/p, VarX = nq/p*. Now set p = q = 1/2, 
EX 109 = 200, Var(Xi99) = 200, and write 


200 1 


< Fs = 
~ 200? =. 200 


P(X109 > 400) < P(|X100 — 200| > 200) 


Remark 1.6.4 X,, ~ NegBin (n,q). 


Worked Example 1.6.5 In this problem, and in a number of problems 
below, we use the term ‘sample’ as a substitute for ‘outcome’. This termi- 
nology is particularly useful in statistics; see Chapters 3 and 4. 

(i) How large a random sample should be taken from a distribution in order 
for the probability to be at least 0.99 that the sample mean will be within 
two standard deviations of the mean of the distribution? Use Chebyshev’s 
inequality to determine a sample size that will be sufficient, whatever the 
distribution. 

(ii) How large a random sample should be taken from a normal distribution 
in order for the probability to be at least 0.99 that the sample mean will be 
within one standard deviation of the mean of the distribution? 

Hint: ®(2.58) = 0.995. 


Solution (i) The sample mean X has mean p and variance o?/n. Hence, by 
Chebyshev’s inequality 


o? 1 


P(|X — p| > 20) < =, 
(I Ml 2 °) S Faep An 


Thus n = 25 is sufficient. If more is known about the distribution of X;, 
then a smaller sample size may suffice: the case of normally distributed X; 
is considered in (ii). 
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(ii) If X; ~ N(u,o7), then X — p ~ N(0,07/n), and probability P(|X — 
| > 7) equals 


i 2\ 1/2 
P(e M /(S) > / (=) )=wazi2 a 


where Z ~ N(O0,1). But P(|Z| > 2.58) = 0.99, and so we require that 
ni/2 > 2.58, i.e. n > 7. As we see, knowledge that the distribution is normal 
allows a much smaller sample size, even to meet a tighter condition. 


Worked Example 1.6.6 What does it mean to say that a function g : 
(0, 00) + R is convex? If f : (0,00) > Ris such that —f is convex, show that 


n+l 
feo 


and deduce that 


N 
1 
J fleyae > 5 £0) + £2) +--+ FON 1) + 540), 

1 

for all integers N > 2. By choosing an appropriate f, show that 
NN+1/2.-(N-]) > NI 

for all integers N > 2. 
Solution g: (0,co) > R is convex if for all x,y € (0,00) and p € (0,1): 
g(px + (1 —p)y) < pg(x) + (1 — p)g(y). When g is twice differentiable, the 


convexity follows from the bound g'(x) > 0, « > 0. 
If —f is convex, then 


—f(pn+ (1 = p)(n+1)) < —pf(n) — (1 - p) f(n +1). 
p)(n+1)=n+1—p 


That is for z= pn+(1—- 


f(x) 2 pf(n) + (1 —p)f(n +1), 
implying that 


n+1 n+1 
| flajax > f [pf (n) + (1 — p)f(n +1)] de. 


As dx = —dp = d(1—p), 


n+1 1 1 
[ se@aee f (a-vyrte) + pf + nap = 540) + 5 Flu +0), 
n 0 
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Finally, summing these inequalities from n = 1 to n = N — 1 gives the first 
inequality. 

Now choose f(x) = Inz: here, f’(r) =1/z, f’(xz) = —1/2x?. Hence, —f is 
convex. By the above: 


N 
| Inadx >In2+---+In(N —1)+In(NY?), 
1 
i.e. 
1 


N 
saw + [ Inadz = =n N + (zIna — 2)|~ 
1 


(nee aw oN i) Sint 
(v+5) 


which is the same as N! < NN+1/2e-(N-1)_ 


Worked Example 1.6.7 A random sample is taken in order to find the 
proportion of Labour voters in a population. Find a sample size, n, that the 
probability of a sample error less than 0.04 will be 0.99 or greater. 


Solution 0.04,/n > 2.58,/pq where pg < 1/4. So, n + 1040. 


Worked Example 1.6.8 State and prove Chebyshev’s inequality. 
Show that if X,, Xo,...are IID RVs with finite mean py and variance o?, 


then 
r( >) >0 


as n — co for alle > 0. 
Suppose that Y;, Y2,... are IID random variables such that P(Y; = 4") = 
2—" for all integers r > 1. Show that 


nt 3 Xj — fp 
i=1 


P(at least one of Y1, Yo,..., Yan takes value 4”) > 1 — et 
as n — oo, and deduce that, whatever the value of K, 
Qn 
P Gos > «) Dey; 
i=1 


Solution Chebychev’s inequality: 


1 
P(|X —EX|>b)< pa varx, for all b > 0, 
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follows from the argument below. 


VarX =E(X - EX)’ > E((X —-EX)?I ((X - Ex)’ >0")) 


> PEI ((x _EX)? > 0°) 


= bP ((X —EX)? > 8?) = #P(|X—EX|>}). 


We apply it to n7'S7"_, Xj, obtaining 


P (m= > ‘ = Ver (S06 _ 0) 


1 
= pa Var(X1) +0 
as n — oo. For RVs Yi, Yo,... specified in the question: 


dn : = P(at least one of Y;, Y2,..., Yon takes value 4”) 
Qn 


=1—][P@ 44") =1- [Pm #4)” 


a 
—(1-P(Y = 4"))”" 
Loe eer: 


Thus, if 2” > K, 


gn 
P (2°57 >) Se a AO 
i=1 


We see that if the Y; have no finite mean, the averaged sum does not exhibit 


convergence to a finite value. 


Worked Example 1.6.9 (i) Suppose that X and Y are discrete random 
variables taking finitely many values. Show that E(X +Y) = EX +EY. 
(ii) On a dry road I cycle at 20 mph; when the road is wet at 10 mph. The 


distance from home to the lecture building is three miles, and the 9:00 am 
lecture course is 24 lectures. The probability that on a given morning the 
road is dry is 1/2, but there is no reason to believe that dry and wet mornings 
follow independently. Find the expected time to cycle to a single lecture and 
the expected time for the whole course. 

A student friend (not a mathematician) proposes a straightforward an- 
swer: 

3 x 24 


average cycling time for the whole course = =——,— = 4h 48 min. 
310+ 520 
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Explain why his answer gives a shorter time. 


Solution (i) For RVs X,Y, with finitely many values x and y, 


R(X+YV)= 5 (e+y)P(X =2,Y =y) 


yy 
=So2S P(X =2,¥ =y)+ > y> P(X =2,Y =y). 
x y y x 


The internal sums Nog P(X = 2,Y = y) and >>, P(X =2,Y = y) equal, 
respectively, P(X = x) and P(Y = y). Thus, 


R(X+Y)= 5 aP(X =2)+ >  yP(Y =y) = EX + EY. 
x y 


(ii) If T is the time to a single lecture, then 


(time to lectures) = 3E = =3 : x : + : x a 13.5 min 

: Se Nepeedy | MO AO Oe Oe , 
The total time = 24 x 13.5 = 5 h 24 min; the assumption of independence 
is not needed, as E)), T; = 50, ET; holds in any case. The ‘straightforward’ 
(and wrong) answer of a student friend gives a shorter time: 


3.x 24 1 3.x 24 Lees 


< 
(1/2) x 10+ (1/2)x20~2”™ 10 °2” 20 
However, the average speed # (1/2) x 10 + (1/2) x 20 = 15. This is a 
particular case of Jensen’s inequality with a strictly convex function 


3 x 24 
g(x) = = 


, £ € (0,00). 


Worked Example 1.6.10 What is a convex function? State and prove 
Jensen’s inequality for a convex function of an RV which takes finitely 
many values. 

Deduce that, if X is a non-negative random variable taking finitely many 
values, then 


E[X] < (EX)? < EX") s--- 


n+1/n 
’ 


Solution (The second part only.) Consider g(x) = x a (strictly) convex 


function on J = [0, 00). By Jensen’s inequality: 


(EX)rtl/n 2 RXMti/n 
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Finally, let Y" = X to obtain 


(EY”)/” < (EY"+)Y/n41 


Worked Example 1.6.11 
(i) If X is a bounded random variable show that 


P(X > X) <e-*E(e*). 


(ii) By looking at power series expansions, or otherwise, check that 


cosh t < ef? /2, 


If Y is a random variable with P(Y = a) = P(Y = —a) = 1/2 show that 


B(e% ) < Qt? /2. 


If Yi, Yo,...,¥, are independent random variables with P(Y, = az) = 
P(Y, = —ax) = 1/2 and Z = Y, + Yo+---+Yy, show, explaining your 
reasoning carefully, that 


z(097) < oO /2 
where A? is to be given explicitly in terms of the ax. 
By using (i), or otherwise, show that, if A > 0, 
P(Z = X) < (470? 208) /2 
for all 9 > 0. Find the @ that minimises e(4°-2»)/2 and show that 
P(Z > 2) <e*/@4"), 
Explain why 
P(|Z| > A) < 2e7?/(24?) 


(iii) If a1 = ag =--- = ay = 1 in (ii), show that 
P(¥i + Yot-s:+¥,| > Qnine“}¥”) < 2 
whenever € > 0. 
Solution (i) Chernoff’s inequality: 
P(X >A)<e°E [e*1(X >r)| < eo Ee*. 


(ii) We have 


cosh t = 3 and e’ /2 = 3 (ae = s ue 


< (2n)! (n)! oa 2hal 
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Now, (2n)(2n —1)---(n+1) > 2” for n > 2. So, 


cosh t < e“/? and E(e’” ) = cosh (0a) < er /2. 


Similarly, 


n n 
B(e°” ) = E (erate An) _ II cosh (Oax) = I] eth 0?/2 = 8767/2 
k=1 k=1 


with A? = )*7_, az. This implies 


P(Z>)< eo) z(097) < @( 470? 2.8) /2 


for all 6 > 0. 
The function f = e(A°0"—29)/2 achieves its minimum for 0 = Aj A?. Thus, 


BZ See 


Now consider —Z = —(¥+---+Yn) = Y{+---+Yjf, where Y/ = -Yx, k = 
1,...,n, has the same distribution as Y;, k =1,...,n; then 


BZ = Neer. 


Thus 
P(\Z| > A) < 20°24", 


If ay = a2 = +++ = Gy = 1, then A? =n, and 


2nIne! 
P(\¥i + Yo+--:+Y¥pl > (QnIne!)'¥?) < Qexp — =e 
nr 


Worked Example 1.6.12 Let 21,2%2,...,%n be positive real numbers. 
Then geometric mean (GM) lies between the harmonic mean (HM) and 
arithmetic mean (AM): 


1 n 1 <1 n 1/n 1 n 
(She) <(Ie) ade 


The second inequality is the AM-GM inequality; establish the first inequality 
(called the HM—GM inequality). 


Solution An AM-—GM inequality: induction in n. For n = 2 the inequality 


is equivalent to 42,22 < (x1 + x2)?. 
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The inductive passage: AM—GM inequality is equivalent to 
1 n 1 n 
—S "Ing; < In (ty) . 
n 4 n 4 
t=, v1 


Function In y is (strictly) concave on [0,00) (which means that (In y)” < 0). 
Therefore, for any a € [0,1] and any yj, y2 > 0 


In(ay, + (1 — a)y2) > alny: + (1— a) Inye. 


Take a =1/n, yi = 21, yo = D092; /(n — 1) to obtain 
safe! ya Li n=l, 1 3 
n{ — xi nz n rd. 
n 4 ‘; an : n n—-1 = : 
Finally, according to the induction hypothesis 
1 i 


To prove the HM—GM inequality, apply the AM—GM inequality to 
Wainy lee 


Hence, 


(2) '<()") =(ie)" 


Worked Example 1.6.13 Let X be a positive random variable taking 


only finitely many values. Show that 

1 1 

— > 
X ~ EX’ 


and that the inequality is strict unless P(X = EX) = 1. 


Solution Let X take values 71, £2,...,%n > 0 with probabilities p;. Then 
this inequality is equivalent to EX > [E(1/X)]71, ie. 


n n 1 -1 
2 Pit > (son2) (1.6.23) 
i= 


i=1 
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We shall deduce inequality (1.6.23) from a bound which is a generalisation 
of the above AM—GM inequality: 


n n 
II ee Nite: (1.6.24) 
i=1 i=1 


In fact, applying inequality (1.6.24) to the values 1/2; yields 


n 1 —l n 
(Se) <Tpn 
j i=1 


Then equation (1.6.23) follows, again by (1.6.24). 

To prove bound (1.6.24), assume that all x; are pair-wise distinct, and 
proceed by induction in n. For n = 1, bound (1.6.23) becomes an equality. 
Assume inequality (1.6.23) holds for n — 1 and prove that 


a Ina; <In [Son] : 
=I = 


We again use the strict concavity of In: 
In fay, + (1 — a)yo] > alny, + (1 — @) In yo. 
Take a = pi, Yi = %1, yo = 07-9 pjxj/(1 — pi) to obtain 


In Spe = Jn (nn + (1 —p1) SH) 
i=1 


i=2 


> pj lnay + (1 — py) In (sou) 


1=2 


where p;, = p;/(1—p1), 7 = 2,...,n. We can now use the induction hypothesis 


n n 
In (>: re) > Soi Ine; 
i=2 i=2 


to get the required result. The equality holds if and only if either p;(1—p;) = 


0 or 1 = >i, piv. Scanning the situation for x2,..., 2%», we conclude that 
the equality occurs if and only if either p;(1 — p;) = 0 for some (and hence 
for all) ¢ or x; = --- = a. According to our agreement, this means that 
n= 1, ie. 


P(X =EX)=1. 
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Worked Example 1.6.14 What is a convex function? State and prove 
Jensen’s inequality for convex functions. Use it to prove the arithmetic— 
geometric mean inequality which states that if a1,a9,...,@, > 0, then 


a1 + ag++*'+ Gy 
n 


> (ayag-++ap)'/”. 
(You may assume that a function with positive second derivative is con- 
vex.) 


Solution A function f: (a,b) > R is convex if for all x, x € (a,b) and 
€ (0,1): 


f(px + (1 —p)a’) < pf(x) + (1 —-p) f(z’). 


Jensen’s inequality for RVs with finitely many values 21,...,2, is that 
for all f as above, 


f [Sor] < So pif (zi) 
1=1 1=1 


for all 71,...,2n € (a,b) and pi,...,pn € (0,1) with pr +---+ pp = 1. 

For the proof, use induction in n. For n = 1, the inequality is trivially 
true. (For n = 2, it is equivalent to the definition of a convex function.) 
Suppose it is true for some n. Then, for n+ 1, let 71,...,%n41 € (a,b) and 
P1,+++)Dngi1 € (0,1) with pp +---+pn41 = 1. Setting p, = p:i/(1 — pn+i), we 
have that p/,...,p), € (0,1) and p| +--- +p}, =1. Then, by the definition 
of convexity and induction hypothesis, 


n+l n 
ij ( ns) Sif (0 — Prat) >_ pias + etn 
i=l 


i=l 


n 
< (1 - Pasi) f (SoH) + Pn+if(@n+41) 
i=1 
n+l 

< (1 — pn+1) Sahl f (xi) + Pnaf (@n41) = Doris f (wi). 


i=l 


So, the inequality holds for n + 1. Hence, it is true for all n. 
For f : (0,00) + R with f(x) = —Ing: f'(x) = —1/z and f'(x) = 1/2? > 
0. So, f is convex. By Jensen’s inequality, with pj = 1/n,i=1,...,n: 


j xs) <= f(a), 
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Worked Example 1.6.15 <A box contains N plastic chips labelled by the 
numbers 1,...,.N. An experiment consists of drawing n of these chips from 
the box, where n < N. We assume that each chip is equally likely to be 
selected and that the drawing is without replacement. Let X1,...,Xn be 
random variables, where X; is the number on the 7th chip drawn from the 
box, i =1,...,n. Set Y= X, + Xo+---+ Xn. 

(i) Check that Var (X;) = (N+ 1)(N — 1)/12. 

Hint: Ni = N(N +1)(2N +1)/6. 

(ii) Check that Cov (X;, X;) = —(N + 1)/12, 14 3. 

(iii) Using the formula 


N 
Var(Y) = > Var(Xi)+ >) Cov(Xi, X;), 
j 
i=1 iAj 


or otherwise, prove that 


n(N +1)(N —n) 


Var(Y ) = 2 


Solution (i) Clearly, EX; = (N + 1)/2. Then 


<~N 2 

_ 1 N(N+DQN41) (N41)? 
ON 6 ( 2 ) 
(N +1)(N —1) 


12 
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(ii) As X; and X; cannot be equal to each other, 


wwa=p elena) (a) 


kts 
= acy (*- “F) ~ = 
NN) TNS Nore 2 
Bee eo 
7 ee racer 2 N 


The first sum equals zero, and the second equals — (Var.X;) /(N — 1). Hence, 
Cov(X;, X;) = —(N + 1)/12. 
(iii) 


Var(Y) = qn OS) 


Worked Example 1.6.16 Suppose N horses take part in m races. The 
results of different races are independent. The probability for horse 7 to win 
any given race is p; > 0, with py +---+ py =1. 

Let Q be the probability that a single horse wins all m races. Express Q 
as a polynomial of degree m in the variables p,, ..., py. 

Prove that Q > N!~™. When does equality Q = N!~™ hold? 


Solution The polynomial is 
Q= pr +--+ DPN. 


By Jensen’s inequality for function f(a) = x2” 


(—)" = 8 (Grit ppt + Sew) < Sf) +++ Shen) = 5@ 
N = nel Ne Nev = WN Pi N PN —~wWe 
Hence, 


1 


Next, put py = 1 — (pi; +---+ pyn-1) and differentiate Q with respect to 
p1,---,PN—1 to find the minimiser. We get 


dQ z : 

=~ = mp?! — m[l —(pi+---+pw-1)]™* =0 

Opi 
i.e. pj = py, 1 <i < N. We obtain that the minimal value, Qmin is N!~™. 
It is achieved when p, =--- = py_-1 = PN x 
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Problem 1.6.17 Let (X;) be a sequence of IID random variables with 
mean jt and variance a”. Show that 


n n 

do(Ke— XP = DO (Xe — 1)? — 0X — 0)’, 

k=1 k=1 

where X = + 07_, Xx. Prove that, if E(X1 — )* < oo, then for every € > 0 
1 n 

S 7 (Xp -— X)? - 0? 


r({t 4} 00 
n 

k=1 
as N — OO. 


Hint: By Chebyshev’s inequality 
> e/ | > 0. 


(|e 
n 


= (Xn p)?— 0? 


k=1 
Problem 1.6.18 Let b,,bo,...,6, be a rearrangement of the positive real 
numbers aj, @2,...,@n. Prove that 


yen 


Hint: TEs a/b) =), 


Problem 1.6.19 Let X be an RV for which EX = p and E(X — p)* = (4. 
Prove that 


PIX — p> <2 


Hint: Use Markov’s inequality for |X — p|*. 


1.7 Branching processes 


Life is a school of probability. 
W. Bagehot (1826-1877), British economist. 


Branching processes are a fine chapter of probability theory. Historically, the 
concept of a branching process was designed to calculate the survival prob- 
abilities of noble families. The name of W.R. Hamilton (1805-1865), the 
famous Irish mathematician, should be mentioned here, as well as F. Galton 
(1822-1911), the English scientist and explorer, and H.W. Watson (1827-— 
1903), the English mathematician. Francis Galton made many remarkable 
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discoveries, in particular, he was the first to observe that fingerprints are in- 
dividual and do not change with the age of a person. This method was imme- 
diately appreciated by Scotland Yard. Since the 1940s branching processes 
have been used extensively in natural sciences, in particular to calculate 
products of nuclear fission (physics) and the size of populations (biology). 
Later they found powerful applications in computer science (algorithms on 
logical trees) and other disciplines. 


The model giving rise to a branching process is simple and elegant. Ini- 
tially, we have an item (a particle or a biological organism) that produces 
a random number of ‘offspring’ each of which produces a random number 
of offspring and so on. This generates a ‘tree-like’ structure where a descen- 
dant has a link to the parent and a number of links to its own offspring. See 
Figure 1.7. 


Each site of the emerging (random) tree has a path that joins it with the 
ultimate ancestor (called the origin, or the root of the tree). The length of 
the path, which is equal to the number of links in it, measures the number 
of generations behind the given site (and the item it represents). Each site 
gives rise to a subtree that grows from it (for some sites there may be no 
continuation, when the number of offspring is zero). 

The main assumption is that the process carries on with maximum inde- 
pendence and homogeneity: the number of offspring produced from a given 
parent is independent of the numbers related to other sites. More precisely, 
we consider RVs Xo, X1, X2,..., where X,, gives the size of the population 


Figure 1.7 
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in the nth generation. That is 


Xo a 1, 

X 1 = the number of offspring after the 1st fission, 
X2 = the number of offspring after the 2nd fission, 
etc. 


The RVs X, and X41 are related by the following recursion: 


Xn, 
Ke = oy (1.7.1) 
1=1 


(n) 


where Y,* ’ is the number of descendants produced by the ith member of the 


nth generation. The RVs y are supposed to be IID, and their common 


v7 
distribution determines the branching process. 


The first important exercise is to calculate the mean value EX, i.e. the 


expected size of the nth generation. By using the conditional expectation, 


EXp = E[E (Xn|Xn-1)] = 5) P(Xn-1 = mE (Xn|Xn-1 =m) 
m 


= Ss" P(Xn-1 =m) Dy ye = So P(Xn-1 =m)m Ey,” ,) 
m i=l m 
=Ey,? >. PO a Sm SEY EX: (1.7.2) 


Value ny,*) does not depend on k and 7, and we denote it by EY for short. 
Then, recursively, 


OX = RY, We RY Fas RY ac (73) 


We see that if EY < 1, EX, — 0 with n, i.e. the process eventually dies out. 
This case is often referred to as subcritical. On the contrary, if EY > 1 (a 
supercritical process), then EX,, + oo. The borderline case EY = 1 is called 
critical. 


Remark 1.7.1 We did not use the independence assumption in (1.7.3). 


A convenient characteristic is the common PGF ¢(s) = Es” of the RVs 


y”) (again it does not depend on n and i). Here, an important fact is that if 
én(s) = Es*” is the PGF of the size of the nth generation, then ¢1(s) = (s) 
and, recursively, 


on+i(s) = on (9(s)), m2 1. (1.7.4) 
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In other words, 


gn(s) = @ (o( +++ (s)-- -)) = o---od(s) n times, (L775) 


where ¢o---o@ stands for the iteration of the map s +> ¢(s). In particular, 


on+1(8) = $ (Gn(s))- 


This construction leads to an interesting analysis of extinction probabilities 


Tn = P(Xpn = 0) = on (0) = 6°"(0). (1.7.6) 


AS Tn41 = O(n), intuitively we would expect the limit 7 = lim 7, to bea 
N—- Oo 


fixed point of map s ++ ¢(s), ie. a solution to z = ¢(z). One such point is 
1 (as (1) = 1), but there may be another solution lying between 0 and 1. 
An important fact here is that function ¢ is convex, and the value of ¢(0) is 
between 0 and 1. Then if ¢’(1) > 1, there will be a root of equation z = ¢(z) 
n (0,1). Otherwise, z = 1 will be the smallest positive root. See Figure 1.8. 

In fact, it is not difficult to check that the limiting extinction probability 
nm exists and 


m7 = the least non-negative solution to z= ¢(z). (1.7.7) 


Indeed, if there exists z € (0,1) with z = ¢(z), then it is unique, and also 
0< ¢(z) <1land0 < ¢(0) = P(Y = 0) < z (as ¢ is convex and ¢(1) = 1). 
Then (0) < $((0)) < ++: < z because ¢ is monotone (and ¢(z) = z). 
The sequence ¢°"(0) must then have a limit which, by continuity, coincides 
with z. 

If the least non-negative fixed point is z = 1, then the above analysis 
can be repeated without changes, yielding that 7 = 1. We conclude that if 
P(Y = 0) > 0, then z > 0 (actually, 7 > P(Y = 0)). On the other hand, 
if (0) = 0 (ie. P(Y = 0) = O), then, trivially, 7 = 0. This establishes 


Figure 1.8 
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equation (1.7.7). We see that even in the supercritical case (with ¢/(1) > 1) 
the limiting extinction probability 7 can be arbitrarily close to 1. 

A slight modification of the above construction arises when we initially 
have several items (possibly a random number Xo). 


Worked Example 1.7.2 In a branching process every individual has 
probability p, of producing exactly k offspring, k = 0,1,..., and the in- 
dividuals of each generation produce offspring independently of each other 
and of individuals in preceding generations. Let X,, represent the size of the 
nth generation. Assume that Xo = 1 and po > 0 and let ¢,(s) be the PGF 
of X,. Thus 


n 
o1(s) = Es®! = S > pes*. 


(i) Prove that 
on+1(8) = bn(1()). 


(ii) Prove that for n <m 


on(8m—n(0)) 
om(0) 


E[s*"|Xm = 0] = 


Solution (i) By definition, 


Gnti(s) = Es*e1 =o (Xn4i = k)s 


where X,, denotes the size of the nth generation. We write 


je =e E(s*|Xn = 0). 


Here yp”) = P(X, =1), and the conditional expectation is 


B(s*1|X_ =D = >) P(Xns = W[ Xn = D8" 
k=l 


Now observe that 


B(s*r+1|X;, = 1) = (Es*1)! = (d1(s))’ 
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where X j is the number of offspring produced by the jth individual of the 
nth generation, (b) all the X; are IID and E(s*|X,, = 1) = Es*! = ¢)(s). 
This relation yields 


EsXnti — dirr((s)) = $n(o1(s)). 


(ii) Denote by 1” the indicator I(Xm = 0). Then P(I’” = 1) = EI” = 
P(Xm = 0) = ¢,(0). Furthermore, 


E[s*"|Xm = 0] = E(s* 10”) /bmn(0). 


Hence, it suffices to check that 


E(s*" I") = bn(8¢m—n(0)). 


Indeed, 
(I) =, MPG hy =O) 
k 
= So sk P(X, = k)P(Xm = 0|X, =k). 
k 


Now, since P(Xm = 0|Xn = k) = dm—n(0)*, 


E(s* 1") = bn(8¢m—n(0)). 


Worked Example 1.7.3 A laboratory keeps a population of aphids. The 
probability of an aphid passing a day uneventfully is q < 1. Given that a 
day is not uneventful, there is a probability r that the aphid will have one 
offspring, a probability s that it will have two offspring, and a probability 
t that it will die of exhaustion, where r+s5+t = 1. Offspring are ready to 
reproduce the next day. The fates of different aphids are independent, as are 
the events of different days. The laboratory starts out with one aphid. 

Let X,, be the number of aphids at the end of n days. Show how to obtain 
an expression for the PGF ¢n(z) of Xn. What is the expected value of Xn? 

Show that the probability of extinction does not depend on gq and that 
if 2r + 3s < 1, the aphids will certainly die out. Find the probability of 
extinction if r= 1/5, s = 2/5 and t = 2/5. 


Solution Denote by ¢x,(z) the PGF of X1, i.e. the number of aphids gen- 
erated, at the end of a single day, by a single aphid (including the initial 
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aphid). Then 


ox,(z)=(1-gt+tqa+(1 q)rz? (1 q)sz°, z>0. 


Write EX, = (EX1)" with EX, = ¢y (1) = q+2(1—q)r+3(1—q)s. Indeed, 
x, (2) = ox,_1(Px,(z)) implies EX, = EX,_1EX1 or EX, = (EX1)”. The 
probability of extinction at the end of n days is ¢x,, (0). It is non-decreasing 


with n and tends to a limit as n — oo giving the probability of (eventual) 


extinction 7. As we already know a < 1 if and only if EX; > 1. The 
extinction probability 7 is the minimal positive root of 


(1—q)t+qz+(1—aq)r2? + (1—q)sz* =z 
or, after division by (1 — q) (since gq < 1): 
t—z4+rz7+sz3 =0. 


The last equation does not depend on q, hence 7 also does not depend on 
q. Condition EX, < 1 is equivalent to 2r + 3s < 1. In the case r = 1/5, 5 = 
2/5,t = 2/5, the equation takes the form 2z° + z? —5z+2 = 0. Dividing by 
(z-1) (as z = 1 is a root) one gets a quadratic equation 2z7+3z—2 = 0, with 


roots z4 = (—3+5) /4. The positive root is 1/2, and it gives the extinction 
probability. 


Worked Example 1.7.4. Let ¢(s) = 1 — p(1 — s)®, where 0 < p < 1 and 
0 <6 <1. Prove that ¢(s) is a PGF and that its iterates are 


n(s) =1—pitP+~ +8" (y — 58" n =1,2,.... 


Find the mean m of the associated distribution and the extinction proba- 
bility 7 = limp-+oo bn (0) for a branching process with offspring distribution 
determined by ¢. 


Solution The coefficients in Taylor’s expansion of $(s) are 


dk 


= uk (8) |s=0 = pB(B—1)---(B-k+ 1)(-1)*"1 > 0, 


k=1,2,...,a9=1-p, 


ak 


and $(1) = >¢,59 ax/k! = 1. Thus, ¢(s) is the PGF for the probabilities 
Pe = ax / kl. 
The second iterate, ¢2(s) = $(¢(s)) is of the form 


((s)) =1—p{l — o(s)? =1- pp®(1— 5)”. 
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Assume inductively that $4(s) = 1— ptt +" (1 — 8) ik <n-1. 
Then 


51 nm—2 n—1 
on(s) =1—pll — dn-1(s)]? =1-p pee +e" (1-8)? 
ere = 3)” 


B 


as required. 
Finally, the mean value is ¢/(1) = lims_,1— ¢/(s) = +00 and the extinction 


probability, 
/(1-B)_ 


a= lim $,(0) =1—p' 
Noo 


Worked Example 1.7.5 At time 0, a blood culture starts with one red 
cell. At the end of 1 min, the red cell dies and is replaced by one of the 
following combinations with probabilities as indicated 
two red cells: - one red, one white: = two white cells: = 

Each red cell lives for 1 min and gives birth to offspring in the same way as 
the parent cell. Each white cell lives for 1 min and dies without reproducing. 
Assume that individual cells behave independently. 

(i) At time n+ 5 min after the culture began, what is the probability that 
no white cells have yet appeared? 

(ii) What is the probability that the entire culture dies out eventually? 


Solution (i) The event 


1 
{by time n+ 3 n0 white cells have yet appeared} 


implies that the total number of cells equals the total number of red cells and 
equals 2”. Then, denoting the probability of this event by p,, we find that 


1 ie eal 
po = 1, P= 7 and Pnayi = Pn Z om 2 I, 


= 1 gn-1 1 gn-2 1 90 7 1 Qn_ 1 : , 
Pn = mT ei eae aan er yo Teed 


(ii) The extinction probability 7 obeys 


ee ae a 
SE RE Te Be 


whence 
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whence 


We must take 7 = 1/3, the smaller root. 


Worked Example 1.7.6 Let {X,} be a branching process such that 
Xp =1, EX, =u. If Y, = Xo+---+ Xn, and forO<s<1 


Un(s) = Es’, 


prove that 
Vn41(s) = 8O(vn(s)), 


where $(s) = Es*. Deduce that, if Y = indo Xn; then Y(s) = Es’ satisfies 
P(s) = so((s)),0<5 5 <1. 
If <1, prove that EY = (1—y)71. 


Solution Write Yng1 =14+Yi4---+ yt. where Yj’ is the total number of 
offspring produced by individual 7 from the first generation. Then the RVs 
Y,) are IID and have the PGF w»(s). Hence 


j 
Un4i(8) = Est" =e AT=¥7) I] 


=sS > P(X = j)(Un(s))? = 8b(n(s)). 
ig 
The infinite series Y = sO Xp has the PGF w(s) = limp... wn(s). Hence, 
it obeys 


v(s) = so(e(s)). 


By induction, EY, =1+yp+---+y”. In fact, Yj = 14+ X, and EY; =1+4+ p. 
Assume that the formula holds for n < k — 1. Then 


BYe = (bx(1))! = O(be—11)) +  (Ye-1 (1), 0) 
=ltme(lt pte te alt pte te te, 


which completes the induction. Therefore, EY = (1 — y)~1. 


Worked Example 1.7.7 Green’s disease turns your hair pink and your 
toenails blue but has no other symptoms. A sufferer from Green’s disease has 
a probability p,, of infecting n further uninfected individuals (n = 0,1, 2,3) 
but will not infect more than 3. (The standard assumptions of elementary 
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branching processes apply.) Write down an expression for e, the expected 
number of individuals infected by a single sufferer. 

Starting from the first principles, find the probability 7 that the disease 
will die out if, initially, there is only one case. 

Let e4 and zy be the values of e and 7 when po = 2/5, pi = po = 0 and 
p3 = 3/5. Let eg and mg be the values of e and 7 when po = p; = 1/10, 
p2 = 4/5 and p3 = 0. Show that e, > eg but 74 > 7B. 


Solution 'The expression for e is e = p, + 2p2 + 3p3. Let X; be the number 
of individuals in the jth generation of the disease and Xo = 1. Assume that 
each sufferer dies once passing the disease on to n < 3 others. Call 7 the 
probability that the disease dies out: 


= So P(X =k)n* = po t+ pin + por? + par. 
k 


Direct calculation shows that e4 = 9/5, eg = 17/10. Value zr, is identified 
as the smallest positive root of the equation 


3 3 2 
0 = po t+ (pr —1)0 + pon? + pg? = (a — 1) (3a? + 57 =). 


Hence 


TA= x 0.46. 


—3 + V33 
6 


Similarly, 7p is the smallest positive root of the equation 


6=G=h (r- 35). 


and tg = 1/8. So, eg > eg and 7,4 > TB. 


Worked Example 1.7.8 Suppose that (X;,7r > 0) is a branching process 
with Xo = 1 and that the PGF for X, is ¢(s). Establish an iterative for- 
mula for the PGF ¢,(s) for X;. State a result in terms of ¢(s) about the 
probability of eventual extinction. 

(i) Suppose the probability that an individual leaves k descendants in the 
next generation is p, = 1/2*+!, for k > 0. Show from the result you state 
that extinction is certain. Prove further that 

Eas 
OA) = (r+1)- rs’ ue 
and deduce the probability that the rth generation is empty. 


’ 


(ii) Suppose that every individual leaves at most three descendants in the 
next generation, and that the probability of leaving & descendants in the 
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next generation is 
3 1 
Pk ( k ) 93° py ey 
What is the probability of extinction? 


Solution (i) Let Y,” be the number of offspring of individual 7 in generation 
n. Then 


Xn = Vt + YE, 


E(s%™+) = E[E(s*"+"|Xn)] = ) > P(Xn = A)E(S EH) 
k=) 
= S°P(Xn = k)E(s*")* = S$ P(Xn = k) (8)* = bn(G(8)), 
k=0 k=0 


and so ¢n+41(s) = ¢n(@(s)). The probability of extinction is the least s € 
(0, 1] such that s = ¢(s). Further, 


1 1 1 
A) a a i eh 


2 


Solving s = 1/(2 — s) yields (s — 1)? = 0, ie. s = 1. Hence the extinction is 
certain. 
The formula for ¢,(s) is established by induction: 
r—(r—1)/(2-s) (r+1)-rs 


br+i(s) = r+1—r/(2—s) = (r+2)—(r4+1)s° 


Hence, the probability that the rth generation is empty is 


; 
0) = ——. 
(0) = 
(ii) 6(s) = g3(1+s8)? whence solving 2° = (1+s)? or (s—1)(s?+4s—1) =0 
we get solutions s = 1, s = +/5 — 2, and the extinction probability is 
J/5—2 0.24. 


Worked Example 1.7.9 Consider a Galton—Watson process (i.e. the 
branching process where the number of offspring is random and indepen- 
dent for each division), with P(X, = 0) = 2/5 and P(X; = 2) = 3/5. 
Compute the distribution of the random variable X9. For generation 3 find 
all probabilities P(X3 = 2k), k = 0,1,2,3,4. Find the extinction probability 
for this model. 
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Solution The extinction probability 7 = 2/3 and the PGF for X2 


2, Ww . 86.5. (3\" 4 
éxa(0) = 5+ oe tee + (3) ot 


The PGF ¢x,(s) = ¢x,(2/5 + 3s?/5). Then 


5 125’ 125 


% 12 86 (2\?. 732° 7," 
P(X3 =0)= “) (£2) =0.54761; 
(X3= 0) = 5 + 95 + ios (Z) +(2) (Z) 


24x38 25x 34 


3 
maps Pog =e. P(X =4) = (2) 


P(X3 = 2) = er 0.17142: 
Ax 3? dA2¢3° 8x35 

P(X3 =4) = ee ee ee 0.17833; 
36 93 

P(X3 = 6) = _ = 0.07465: 


P(X3 = 8) = (3) = 0.02799. 


Worked Example 1.7.10 No-one in their right mind would wish to be a 
guest at the Virtual Reality Hotel. The rooms are numbered 0 to (3% —3)/2, 
where N is a very large integer. If 0 < i < (3N~!—3)/2 and j = 1,2,3 there 
is a door between Room 7 and Room 37+ 7 through which (if it is unlocked) 
guests may pass in both directions. In addition, any room with a number 
higher than (3‘~! — 3)/2 has an open window through which guests can 
(and should) escape into the street. So far as the guests are concerned, there 
are no other doors or windows. Figure 1.9 shows part of the floor plan of 
the hotel. 


Figure 1.9 
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Each door in the hotel is locked with probability 1/3 independently of the 
others. An arriving guest is placed in Room 0 and can then wander freely 
(insofar as the locked doors allow). Show that the guest’s chance of escape 


is about (9 — V27) /4. 


Solution Denote by X;, the number of rooms available at level r from 0. 
Writing ¢,(t) = Et*", with ¢, = ¢: 


br+i(t) = Et = DPX =i) E(t" X,. =a) 
= S°P(X, =7) (Et*)' 
= Dd P(Xr = 1) (6())" = o(4(0). 


Then $r+1(t) = $(¢(-:- O(t)-++)) = b (Gr(t)), and 
P (can’t reach level r) = ¢(¢,(0)) . 


Now PGF ¢(t) equals 


TNS Dey ove a ave 1 
2 Se |S) EHS aay Se | eo = (GEE DP Re), 


Hence, the equation ¢(t) = t becomes 


Q7t =1+4+ 6t + 127 + 843, ie. 1 — 21¢ + 1227 + 8¢° =0. 


By factorising 1 — 21¢ + 12¢? + 8¢° = (t — 1)(8¢? + 20t — 1), we find that the 
roots are 


—5+ /27 
ae 


The root between 0 and 1 is (27 — 5) i 4. The sequence ¢,,(0) is monotone 
increasing and bounded above. Hence, it converges as N — co: 


V27—5 
P(no escape) = Sas, 4 
Then 
V27-5 9-V27 
P(escape) + 1 TT = a © 0.950962. 
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Problem 1.7.11 (i) A mature individual produces offspring according to 
the PGF ¢(s). Suppose we start with a population of & immature individuals, 
each of which grows to maturity with probability p and then reproduces, 
independently of the other individuals. Find the PGF of the number of 
(immature) individuals in the next generation. 

(ii) Find the PGF of the number of mature individuals in the next gener- 
ation, given that there are k mature individuals in the parent generation. 

Hint: (i) Let R be the number of immature descendants of an immature 
individual. If X” is the number of immature individuals in generation n, 
then, given that X) = k, 


XO) = Rese + Rp, 
where R; ~ R, independently. The conditional PGF is 


B(sX "PK =k) = (alo(s)))* 


where g(x) = 1—p-+ pz. 
(ii) Let U be the number of mature descendants of a mature individual, 
and Z(") be the number of mature individuals in generation n. Then, as 


before, conditional on Z“™ = k, 

ZO) — Uy +--+ Up, 
where U; ~ U, independently. The conditional PGF is 
E (8212 = k) = (dlo(s)])*. 


Problem 1.7.12 Show that the distributions in parts (i) and (ii) of the 
previous problem have the same mean, but not necessarily the same variance. 


Hint: 


Te (X |x =1)|_ = po", 


2 
ds (s2 2 = 1) 


= p°$"(1). 


s=1 


2 
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2.1 Uniform distribution. Probability density functions. 
Random variables. Independence 


Probabilists do it continuously but discreetly. 
(From the series ‘How they do it’.) 


Bye, Bi, Variate. 
(From the series ‘Movies that never made it to the Big Screen’.) 


After developing a background in probabilistic models with discrete out- 
comes we can now progress further and do exercises where uncountably 
many outcomes are explicitly involved. Here, the events are associated with 
subsets of a continuous space (a real line R, an interval (a,b), a plane R?, a 
square, etc.). The simplest case is where the outcome space () is represented 
by a ‘nice’ bounded set and the probability distribution corresponds to a 
unit mass uniformly spread over it. Then an event (i.e. a subset) A C Q 
acquires the probability 


(2.1.1) 


where v(A) is the standard Euclidean volume (or area or length) of A and 
v(Q) that of Q. 

The term ‘uniformly spread’ is the key here; an example below shows that 
one has to be careful about what exactly it means in the given context. 


Example 2.1.1 This is known as Bertrand’s Paradox. A chord has been 
chosen at random in a circle of radius r. What is the probability that it is 
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longer than the side of the equilateral triangle inscribed in the circle? The 
answer is different in the following three cases: 


(i) the middle point of the chord is distributed uniformly inside the circle; 
(ii) one endpoint is fixed and the second is uniformly distributed over the 
circumference; and 
(iii) the distance between the middle point of the chord and the centre of 
the circle is uniformly distributed over the interval [0,7]. 


In fact, in case (i), the middle point of the chord must lie inside the circle 
inscribed in the triangle. Hence, 


area of the inscribed circle _ mr?/4__ 1 


P(chord 1 = = - 
(chord longer) area of the original circle mr? 4 


In case (ii), the second endpoint must then lie on the opposite third of the 
circumference. Hence, 


1 
P(chord longer) = 3° 


Finally, in case (iii), the middle point of the chord must be at distance < r/2 
from the centre. Hence, 


1 
P(chord longer) = 5° 


A useful observation is that we can think in terms of a uniform probability 
density function assigning to a point x € 2 the value 


uni _ 1 
Q(x) = way): (2.1.2) 


with the probability of event A C (2 calculated as the integral 
F 1 
P(A = | s@ax=— fax, 2.1.3 
(4)= fw eae= oe | (2.1.3) 


giving of course the same answer as formula (2.1.1). Because f&™ > 0 and 
iB {G"(x)dx = 1, the probability of event A C 2 is always between 0 and 
1. Note that the mass assigned to a single outcome w represented by a 
point of Q is zero. Hence the mass assigned to any finite or countable set of 
outcomes is zero (as it is the sum of the masses assigned to each outcome); 
to get a positive mass (and thus a positive probability), an event A must be 
uncountable. 


Worked Example 2.1.2. Alice and Bob agree to meet in the Copper Ket- 
tle after their Saturday lectures. They arrive at times that are independent 
and uniformly distributed between 12:00 and 13:00. Each is prepared to wait 
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1—s/60 


1—s/60 


s/60 


s/60 


Figure 2.1 


s minutes before leaving. Find a minimal s such that the probability that 
they meet is at least 1/2. 


Solution The set 2 is the unit square S with co-ordinates 0 < x,y < 1 
(measuring the time in fractions of an hour between 12:00 and 13:00). Each 
w = (x,y) € Q specifies Alice’s arrival time x and Bob’s y. Then the event: 
‘they arrive within s minutes of each other’ is a strip around the diagonal x = 
y: 
A= {(x,y) ES: |x-y|< =}. 
60 

Its complement is formed by two triangles, of area (1/2) (1 — s/60)* each. 
So, the area v(A) is 


and we want it to be > 1/2. See Figure 2.1. 
This gives s > 60(1 — V2/2) © 18 minutes. 


Worked Example 2.1.3 A stick is broken into two at random; then the 
longer half is broken again into two pieces at random. What is the probability 
that the three pieces will make a triangle? 


Solution Let the stick length be @. If x is the place of the first break, then 
0<a < and zis uniform on (0, @). If x > @/2, then the second break point 
y is uniformly chosen on the interval (0, x). See Figure 2.2. Otherwise y is 
uniformly chosen on (z, @). 

Thus 


Q=f(g,0)2 0s ¢,9< Gy <2 tor eS 7/2 and a oy < £ fora < £/2}. 


See Figure 2.3. 
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Figure 2.2 


Figure 2.3 


To make a triangle (x,y) must lie in A, where 


A= {max [x,y] > smn [zy] < = la 


Owing to the symmetry, 


ee 
Y 51° 


y) £/2 1 e/2+a y) £/2 


which leads to the answer 2 In 2 — 1 = 0.3862944. 


Worked Example 2.1.4 A stick is broken in two places chosen before- 
hand, completely at random along its length. What is the probability that 


the three pieces will make a triangle? 


Solution The answer is 1/4. This value is less than the previous one. One 


difference between the two is that the whole sample space 2 in the last 
problem is the square (0,@) x (0,@), i.e. is larger than in the former (as 
the shorter of the two halves can be broken, which is impossible in the 
first problem). However, the event of interest is given by the same set A. 
Intuitively, this should lead to a smaller probability. This is indeed the case, 
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despite the fact that the probability mass in the two problems is spread 
differently. 


Worked Example 2.1.5 The ecu coin is a disc of diameter 4/5 units. In 
the traditional game of drop the ecu, an ecu coin is dropped at random onto 
a grid of squares formed by lines one unit apart. If the coin covers part of 
a line you lose your ecu. If not, you still lose your ecu but the band plays 
the national anthem of your choice. What is the probability that you get to 
choose your national anthem? 


Solution Without loss of generality, assume that the centre of the coin lies 
in the unit square with corners (0;0), (0;1), (1;0), (1;1). You will hear an 
anthem when the centre lies in the inside square S described by 

2 3.2 3 


SG <a eR SE 
aie) ae aaa 


Hence, 


1 
P (anthem) = area of S = 35: 


There are several serious questions arising here which we will address 
later. One is the so-called measurability: there exist weird sets A C 2 (even 
when 2 is a unit interval (0,1)) which do not have a correctly defined vol- 
ume (or length). In general, how does one measure the volume of a set A 
in a continuous space? Such sets may not be particularly difficult to de- 
scribe (for instance, Cantor’s continuum XK has a correctly defined length), 
but calculating their volume, area or length goes beyond standard Riemann 
integration, let alone elementary formulas. (As a matter of fact, the correct 
length of K is zero.) To develop a complete theory, we would need the so- 
called Lebesgue integration, which is called after H.L. Lebesgue (1875-1941), 
the famous French mathematician. Lebesgue was of a very humble origin, 
but became a top mathematician. He was renowned for flawless and elegant 
presentations and written works. In turn, the Lebesgue integration requires 
the concept of a sigma-algebra and an additive measure which leads to a 
far-reaching generalisation of the concept of length, area and volume en- 
capsulated in a concept of a measure. We will discuss these issues in later 
volumes. 

An issue to discuss now is: what if the distribution of the mass is not 
uniform? This question is not only of a purely theoretical interest. In many 
practical models 2) is represented by an unbounded subset of a Euclidean 
space R?¢ whose volume is infinite (e.g. by R@ itself or by Ry = [0,00), for 
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d = 1). Then the denominator v(Q) in equation (2.1.1) becomes infinite. 
Here, the recipe is: consider a function f > 0 with f, f()da = 1 and set 


P(A) = [ #evae, AC 


(see equation (2.1.3)). Such a function f is interpreted as a (general) probabil- 
ity density function (PDF). The following natural (and important) examples 
appear in problems below. 

A uniform distribution on an interval (a,b), a < b: here 2 = (a,b) and 


f(z) I(a<a<b). (2.1.4) 


ba 


A Gaussian, or normal, distribution: here Q = R and 


ia= = exp ( - (ce Ww?) , fer, (2.1.5) 


Here uz € R, o > 0 are parameters specifying the distribution. 

The graphs of normal PDFs on an interval around the origin and away 
from it are plotted in Figures 2.4 and 2.5. 

This is the famous curve about which the great French mathematician J.H. 
Poincaré (1854-1912) said ‘Experimentalists think that it is a mathematical 
theorem while the mathematicians believe it to be an experimental fact’. 

An exponential distribution: here OQ = R+ and 


f(z) = Ae I(x > 0), TER. (2.1.6) 


Here > 0 is a parameter specifying the distribution. 


wr 

“1 mu = 0, sigma = 1 
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Figure 2.4 The normal PDFs, I. 
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Figure 2.5 The normal PDFs, II. 
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Figure 2.6 The exponential PDFs. 


The graphs of the exponential PDFs are shown in Figure 2.6. 

A generalisation of formula (2.1.6) is the Gamma distribution. Here again 
Q = R, and the PDF is 

ae a—1,—Axz 
ia= ra e “I(x > 0), (2.1.7) 

with parameters a, \ > 0. Here (a) = [5° x°~'e~*da (the value of the 
Gamma function) is the normalising constant. Recall, for a positive integer 
argument, I'(n) = (n — 1)!; in general, for a > 1: (a) = (a-1) (a— 1). 

The Gamma distribution plays a prominent role in statistics and will 
repeatedly appear in later chapters. The graphs of the Gamma PDFs are 
sketched in Figure 2.7. 
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Figure 2.8 The Cauchy PDFs. 


Another example is a Cauchy distribution, with Q = R and 


eos 


mT? + (a@ — a)?’ 


ceER, (2.1.8) 


with parameters a € Rand7 > 0. There is a story that the Cauchy distribu- 
tion was discovered by Poisson in 1824 when he proposed a counterexample 
to the Central Limit Theorem (CLT); see below. The graphs of the Cauchy 
PDFs are sketched in Figure 2.8. 

Cauchy was a staunch royalist and a devoted Catholic and, unlike many 
other prominent French scientists of the period, he had difficult relations 
with the Republican regime. In 1830, during one of the nineteenth-century 
French revolutions, he went into voluntary exile in Turin and Prague where 
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he gave private mathematics lessons to the children of the Bourbon royal 
family. His admission to the French Academy occurred only in 1838, after 
he had returned to Paris. 

The Gaussian distribution will be discussed in detail below. At this stage 
we only indicate a generalisation to the multidimensional case where Q = 
R? and 

1 
(/27)4(det 5) 


Here, x and yu are real d-dimensional vectors: 


= Ta xP ; (x = p, Se i))| . (2.1.9) 


Baal M1 
aaa ee : € R¢, 
Xd Ld 


and © is an invertible positive-definite d x d real matrix, with the determi- 
nant det © > 0 and the inverse matrix D7! = (a The matrix © is called 
positive- definite if it can be represented as the product & = AA* and strictly 
positive-definite if in this representation the matrix A is invertible, i.e. the 
inverse A~! exists (in which case the inverse matrix U7! = A*~!A7! also 
exists). It is easy to see that a positive-definite matrix ¥ is always symmetric 
(or Hermitian), i.e. obeys &* = %. Hence, a positive-definite matrix has an 
orthonormal basis of eigenvectors, and its eigenvalues are non-negative (pos- 
itive if it is strictly positive-definite). Further, ( , ) stands for the Euclidean 
scalar product in R?: 


d 


(x= 4, 5°71(x— w)) = SO (wi — ma) EG (ay — oy). 
ij=l 


A PDF of this form is called a multivariate normal or multivariate Gaussian 
distribution. 


Remark 2.1.6 We already have seen a number of probability distribu- 
tions bearing personal names: Gauss, Poisson, Cauchy; more will appear in 
Chapters 3 and 4. Another example (though not as frequent) is the Simpson 
distribution. Here we take X, Y ~ U(0,1), independently. Then X + Y has 
a ‘triangular’ PDF known as Simpson’s PDF: 


U, O0<u<l, 
fxsy(u)=42-u, 1l<u<2, 
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T. Simpson (1700-1761), an English scientist, left a notable mark in in- 
terpolation and numerical methods of integration. He was the most dis- 
tinguished of the group of itinerant lecturers who taught in fashionable 
London coffee-houses (a popular way of spreading scientific information in 
eighteenth-century England). 


As before, we face a question: what type of function f can serve as PDF? 
(The example with f(a) = I(x € (0,1)\K), where K C (0,1) is Cantor’s set, 
is typical. Here f > 0 by definition, but how should ie f(a)dx be defined?) 
And again, the answer lies in the theory of Lebesgue integration. Fortunately, 
in ‘realistic’ models, these matters arise rarely and are overshadowed by far 
more practical issues. 

So, from now until the end of this chapter our basic model will be where 
outcomes w run over an ‘allowed’ subset of Q (such subsets are called measur- 
able and will be introduced later). Quite often Q will be R¢. The probability 
P(A) will be calculated for every such set A (called an event in () as 


P(A) = i f (x)dx. (2.1.10) 
A 
Here f is a given PDF f > 0 with f, f(x)dx = 1. 


As in the discrete case, we have an intuitively plausible property of addi- 
tivity: if A;, Ag,...is a (finite or countable) sequence of pair-wise disjoint 


: (U, Aj) = Ds P(Aj), (2.1.11) 


J 


events, then 


while, in a general case, P (U; Aj) < >U; P(A;). As P(Q) = 1, we obtain 
that for the complement AS = 2 \ A, P(A‘) = 1 — P(A), and for the set- 
theoretical difference A \ B, P(A\ B) = P(A) — P(Af)B). Of course, more 
advanced facts that we learned in the discrete-space case remain true, such 
as the inclusion—exclusion formula. 

In this setting, the concept of a random variable (RV) develops, unsur- 
prisingly, from its discrete-outcome analogue: a RV is a function 


X: wey X(w), 


with real or complex values X(w) (in the complex case we again consider a 
pair of real RVs representing the real and imaginary parts). Formally, a real 
RV must have the property that, for all z € R, the set {w EO: X(w) < x} is 
an event in 2 to which the probability P(X < y) can be assigned. Then with 
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Figure 2.9 


each real-valued RV we associate its cumulative distribution function (CDF) 
yE€ Rv Fx(z) = P(X < 2) (2.1.12) 


varying monotonically from 0 to 1 as y increases from —co to oo. See 
Figure 2.9. 
The quantity 


Fx(x) =1— Fx(2) = P(X > 2) (2.1.13) 


describing tail probabilities is also often used. 

Observe that definition (2.1.12) leads to a CDF F x(x) that is left- 
continuous (in Figure 2.9 it is represented by black dots). It means that 
Fx (apn) ZA Fx (ax) whenever x, 7 x. On the other hand, for all x, the right- 
hand limit limz,\.2 Fx(%n) also exists and is > Fx(x), but the equality is 
not guaranteed (in the figure this is represented by circles). Of course the 
tail probability Fx(z) is again left-continuous. 

However, if we adopt the definition that Fy (a) = P(X < x) (which is the 
case in some textbooks) then Fx will become right-continuous (as well as 


Example 2.1.7 If X = bisa (real) constant, the CDF F'x is the Heaviside- 
type function 


Fx(y) = I(y > b). (2.1.14) 


If X = I, the indicator function of an event A, then P(X < y) equals 
0 for y < 0, 1 — P(A) for 0 < y < 1, and 1 for y > 1. More generally, 
if X admits a discrete set of (real) values (i.e. finitely or countably many, 
without accumulation points on R), say yj € R, with y; < yj4i, then Fx (y) 
is constant on each interval y; < « < yj41 and has jumps at points y; of 
size P(X = yj). 
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Figure 2.10 The Poisson CDFs. 


Observe that all previously discussed discrete-outcome examples of RVs 
may be fitted into this framework. For instance, if X ~ Bin(n, p), then 


n m n-m 
Fx(y) = > ( )p (1 — p) Igy >0). (2.1.15) 
0<m<y, m<n 
If X ~ Po(A) 
=e" ye I(y > 0). (2.1.16) 
0<n<y nl 


Figure 2.10 shows the graphs of F'y ~ Po()). 
If X ~ Geom(q), 


Fx(y) =I(y>0)(1-4@) SO a. (2.1.17) 
O<n<y 


The graph of the CDF of RV X ~ Geom(q) is plotted in Figure 2.11, together 
with that of a Poisson RV with \ = 1 (both RVs have the same mean 
value 1). 

We say that an RV X has a uniform, Gaussian, exponential, Gamma or 
Cauchy distribution (with the corresponding easel a if a ee Fx(y) 
is prescribed by the corresponding PDF, i.e. P(X < y) = f f(x)I(x < y)dz. 
For example, for a uniform RV, 


Fx(y) = 4 (y—a)/(b-—a),a<y<b, (2.1.18) 
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Figure 2.11 The geometric and Poisson CDFs. 


for a Gaussian, 


1 y 1 _ 
Fx(y) = / exp [sate — wl dz = ® (X*) » YER; 


«Ono 

(2.1.19) 

for an exponential RV, 

0, U] < 0, 
Fx(y) = ‘ i ee (2.1.20) 
for a Gamma RV 
av f# 

Fx(y) = ria f a® te *dal(y > 0); (2.1.21) 

and for a Cauchy RV, 

1 -_ 

Fx(y) = — fan (2 - =) 4 | ,yeR. (2.1.22) 


Correspondingly, we write X ~ U(a,b), X ~ N(p,07), X ~ Exp(A), X ~ 
Gam(a, A) and X ~ Ca(a,rT). 

In Figures 2.12—2.15 we show some graphics for these CDFs. 

In general, we say that X has a PDF f (and write X ~ f) if, for ally € R, 


y 
PLA <9) =) fede: (2.1.23) 
Then, for all a,b € R with a < b, 
b 
Pa<X<b)= [ ences (2.1.24) 


and in general, for all measurable sets A C R, P(X € A) = J, f A 
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Figure 2.13 The exponential CDFs. 


Note that, in all calculations involving PDFs, the sets C with /, cdx =0 
(sets of measure 0) can be disregarded. Therefore, probabilities P(a < X < 
b) and P(a < X < b) coincide. (This is, of course, not true for discrete RVs.) 

The median m(X) of RV X gives the value that ‘divides’ the range of X 
into two pieces of equal mass. In terms of the CDF and PDF, 


Oe en lv: Fie > ;| a [y: i: Fecdee ;| _ (2.1.25) 


If Fy is strictly monotone and continuous, then, obviously, m(X) equals the 
unique value y for which F’x(y) = 1/2. In other words, m(X) is the unique 
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y for which 


[. fx(a)dz = [ fx (x)da. 


The mode of an RV X with a bounded PDF fy is the value x where fx 
attains its maximum; sometimes one refers to local maxima as local modes. 


Worked Example 2.1.8 My Mum and I plan to take part in a televised 
National IQ Test where we will answer a large number of questions, to- 


gether with selected groups of mathematics professors, fashion hairdressers, 
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brass band players and others, representing various sections of society (not 
forgetting celebrities of the day of course). The IQ index, we were told, is 
calculated differently for different ages. For me, it is equal to 


—801n(1— 2), 


where z is the fraction of my correct answers, which can be anything between 
O and 1. In my Mum’s case, the IQ index is given by a different formula 


3/4—y 
] = 
701n 3/4 


70 1n(3/4 — y) + 701n3/4, 


where y is her fraction of correct answers. (In her age group one does not 
expect it to exceed 3/4 (sorry, Mum!).) 

We each aim to obtain at least 110. What is the probability that we will 
do this? What is the probability that my IQ will be better than hers? 


Solution Again, we employ the uniform distribution assumption. The out- 
come w = (21,22) is uniformly spread in the set Q, which is the rectangle 
(0,1) x (0,3/4), of area 3/4. We have a pair of RVs: 


X(w) = —80In(1 — 21), for my IQ, 


_ 3/4-—2 
Y(w) = —701In 3/4 


S for my Mum’s IQ, 


and for all y > 0: 


1 3/4 
1 1 
PLEX < 9) = sail fe [Int —2£1)< 307 drodxr1 
0 O 
{—e—y/80 


1 3/4 i 
_ 1 3 — x 1 
PY <v)=57 f ft/ In 3/4 < =) dxodx, 
0 0 


3(1-e7 9/7) /4 
a / dy =1—e7¥/" ie. Y ~ Exp(1/70). 
0 


Next, P(min [X,Y] < y) = 1 — P(min[X, Y] > y). In turn, the probability 
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P(min[X,Y] > y) equals 

1 y 3/4 — x2 y 

— 1 In (1 > ] > dxed 
iB ( n(1— #1) 2 a5, In “Sag 2 7g ) deaden 

0 


0 
= e7Y/80e—y/70 = e78y/112, 


Hence, min[X, Y] ~ Exp(3/112). Therefore, 
P(both reach 110) = e~3*120/122 = 73 


(pretty small). To increase the probability, we have to work hard to change 
the underlying uniform distribution by something more biased towards 
higher values of 2; and x2, the fractions of correct answers. 

To calculate P(X > Y), it is advisable to use the Jacobian 


O(21, £2) 
O(u1, U2) 
of the ‘inverse’ change of variables 11 = 1 — e~“/®9, ag = 3(1 — e~t2/) /4 
(the ‘direct’? change is uy = —80In(1 — 21), ug = —7O0In[(3/4 — 
X»2)/(3/4)]). Indeed, 
= U tay U 0 
Didisua) 2 80" We Oe ene 
and 
3/4 
1 Ae 
PX >Y)= ad I (-soin(1 — 21) > —70ln C5) dx2dx1 


1 1 1 
~ 3/4 sae —e 2/9 T (uy > ug)durduy 
oe) io.) 8 
1 1 
= f etl? f Gem duds Sie 
0 up) 


In the above examples, the CDF F either had the form 
y 
PQ)=f flejaz, veR, 


or was locally constant, with positive jumps at the points of a discrete set 
X C R. In the first case one says that the corresponding RV has an abso- 
lutely continuous distribution (with a PDF f), and in the second one says 
it has a discrete distribution concentrated on X. It is not hard to check that 
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Figure 2.16 


absolute continuity implies that F' is continuous (but not vice versa), and, 
for the discrete distributions, the CDF is locally constant, i.e. manifestly 
discontinuous. However, a combination of these two types is also possible. 

Furthermore, there are CDFs that do not belong to any of these cases, 
but we will not discuss them here in any detail (a basic example is Cantor’s 
staircase function, which is continuous but grows from 0 to 1 on the set K 
which has length zero). See Figure 2.16. 

Returning to RVs, it has to be said that, for many purposes, the detailed 
information about what exactly the outcome space 2 is where X(w) is de- 
fined is actually irrelevant. For example, normal RVs arise in a great variety 
of models in statistics, but what matters is that they are jointly or individ- 
ually Gaussian, i.e. have a prescribed PDF. Also, an exponential RV arises 
in many models and may be associated with a lifetime of an item or a time 
between subsequent changes of a state in a system, or in a purely geometric 
context. It is essential to be able to think of such RVs without referring to 
a particular Q. 

On the other hand, a standard way to represent a real RV X with a 
prescribed PDF f(x),x € R, is as follows. You choose 2 to be the support 
of the PDF f, ie. the set {2 € R: f(x) > O}, define the probability 
P(A) as |, f(a)da (see equation (2.1.10)) and set X(w) = w (or, if you like, 
X(x) =a, x € 2). In fact, then, trivially, the event {X < y} will coincide 
with the set {vc © R: f(x) >0, x < y} and its probability will be 


PX <y)= [tor <gide= [ f(a)dax. 


In the final part of the solution to Worked Example 2.1.8 we did exactly 
this: the change of variables uy = —80In(1— 21), u2 = —701n[(3/4 — 
x2)/(3/4)] with the inverse Jacobian 


3 1 e7 ¥1/80 1 -w/70 


4 80 70 
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has put us on the half-lines uz > 0 and u2 > 0, with the PDFs 741/80 / 80 
and eo u1/70 / 70 and the factor I(u,; > uz) indicating the event. 

So, to visualise a uniform RV on interval (a,b), we take the model with 
Q = (a,b) and define f by equation (2.1.4); for an exponential or Gamma dis- 
tribution, Q = R,, the positive half-line, and f is defined by equation (2.1.6) 
or (2.1.7); and for a Gaussian or Cauchy distribution, 2 = R, the whole line, 
and f is defined by equation (2.1.5) or (2.1.8). In all cases, the standard 
equation X (x) = x defines an RV X with the corresponding distribution. 

Such a representation of an RV X with a given PDF/CDF will be partic- 
ularly helpful when we have to deal with a function Y = g(X). See below. 

So far we have encountered two types of RVs: either (i) with a discrete 
set of values (finite or countable), or (ii) with a PDF (on a subset of R 
or C). These types do not exhaust all occurring situations. In particular, a 
number of applications require consideration of an RV X that represents a 
‘mixture’ of the two above types where a positive portion of a probability 
mass is sitting at a point (or points) and another portion is spread out 
with a PDF over an interval in R. Then the corresponding CDF Fx has 
jumps at the points 2; where probability P(X = x;) > 0, of a size equal 
to P(X = 2;) > 0, and is absolutely continuous outside these points. A 
typical example is the CDF Fy of the waiting time W in a queue with 
random arrival and service times (a popular setting is a hairdresser’s shop, 
where each customer waits until the hairdresser finishes with the previous 
customer). You may be lucky in entering the shop when the queue is empty: 
then your waiting time is 0 (the probability of such event, however small, 
is > 0). Otherwise you will wait for some positive time; under the simplest 
assumptions about the arrival and service times: 


0, ¥ <9, 
By) SWC Do Me cye iy eh (2.1.26) 
LL 


Here A, u > 0 are the rates of two exponential RVs: A is the rate of the in- 
terarrival time, and py that of the service time. Formula (2.1.26) makes sense 
when A < yp, i.e. the service rate exceeds the arrival rate, which guarantees 
that the queue does not overflow with time. The probability P(W = 0) that 
you wait zero time is then equal to 1 — A/~ > 0 and the probability that 
you wait a time > y equals reg H-AVY Hy, 

In this example we can say that the distribution of RV W has a discrete 
component (concentrated at point 0) and an absolutely continuous compo- 
nent (concentrated on (0,00)). 
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Very often we want to find the PDF or CDF of a random variable Y that 
is a function g(X) of another random variable X, with a given PDF (CDF). 


Worked Example 2.1.9 The area of a circle is exponentially distributed 
with parameter \. Find the PDF of the radius of the circle. 
Answer: fr(y) = QrArye TY I(y > 0); 


Worked Example 2.1.10 The radius of a circle is exponentially dis- 
tributed with parameter ». Find the PDF of the area of the circle. 


Answer: farea(s) = NeW /./Eirs. 

In these two problems we have dealt with two mutually inverse maps given 
by the square and the square root. For a general function g, the answer is the 
result of a straightforward calculation involving the inverse Jacobian. More 
precisely, if Y = g(X), then the direct change of variables is y = g(x), and 


1 
lg'(x)| 


f(y) =I(y € Range (g)) So fx(a) (2.1.27) 


x: g(x)=y 


Here, Range (g) stands for the set {y: y = g(x) for some x € R}, and we 
have assumed that the inverse image of y is a discrete set, which allows us 
to write the summation. 

Equation (2.1.27) holds whenever the RHS is a correctly defined PDF 
(which allows that g/(x) = 0 on a ‘thin’ set of points y). 


Example 2.1.11 If b,c € R are constants, c ~ 0, then 


fxsolv) = fx(y— 8), and fex(y) = gixu/9) 


Combining these two formulas, it is easy to see that the normal and Cauchy 
distributions have the following scaling properties: 


1 
if X ~ N(p,07), then—(X — p) ~ N(0, 1), 
o 


and 
if X ~ Ca(a,rT), then A(x — a) ~ Ca(1,0). 
Also, 
if X ~ N(p,07), then cX +b~ N(cu +b, c?o7), 
and 


if X ~ Ca(a,T), then cX +b ~ Ca(ca + b,cr). 
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A formula emerging from Worked Example 2.1.8 is 


fyxly) = 2Vufx(y?)I(y > 0) 


(assuming that RV X takes non-negative values). Similarly, in Worked Ex- 
ample 2.1.9, 


1 
fxaly) = 5 (Fx (vy) + fx(-vVy)) Ly > 0 
xy) = oI (vy) (—Vy)) L(y > 0) 
(which is equal to aap lx(VuL(y > 0) if X is non-negative). 
Assuming that g is one-to-one, at least on the range of RV X, for- 
mula (2.1.27) is simplified as the summation is reduced to a single inverse 


image x(y) = g"1(y): 
1 
fy(y) = Ix) ay /Y € Range(g)) 


= fx[x(y)] |’(y)| T(y € Range(g)). (2.1.28) 


Example 2.1.12 Let X ~ U(0,1). Given a,b,c,d > 0, with d < c, con- 
sider RV 

_ atbx 

 ¢— dX” 
Using formula (2.1.28), one immediately obtains the PDF of Y: 


COE bc + ad Z aat+b 
oc (b+ dy)?’ 4 c’c—d)- 


Dealing with a pair X, Y or a general collection of RVs Xj, ..., Xn, it is 
convenient to use the concept of the joint PDF and joint CDF (just as we 
used joint probabilities in the discrete case). Formally, we set, for real RVs 
XI, 184 Xn: 


Fx(yi, eee Un) = P(X < V1; sey Xn < Yn) (2.1.29) 


and we say that they have a joint PDF f if, for all y1, ..., yn € R, 


FeGecci i af f(x)dx. (2.1.30) 


Here X is a random vector formed by RVs X, ..., Xn, and x is a point in 
R”, with entries 71, ..., Un: 
Xy Li 
X= : » X= : 
Xn ‘Tey 
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Next, dx = dz, x --- x dz, the Euclidean volume element. Then, given a 
collection of intervals 5 (ay ie ee a es 


by n 
ractidnnmce| af f(x)dx 


and, in fact, P(X € A) = f, f 4 /(x)dx for for all measurable subsets A € R”. 

In Chapters 3 and 4 of this a (dealing with IB statistics) we will repeat- 
edly use the notation fx for the joint PDF of RVs X41, ..., Xn constituting 
vector X; in the case of two random variables X, Y we will write fx y(z, y) 
for the joint PDF and Fx y(z, y) for the joint CDF, x,y € R. A convenient 
formula for fx,y is 


0? 
fx,y(z,y) = Dandy XY 9): (2.1.31) 


As before, joint PDF f(x,y) has only to be non-negative and have the 
integral [po f(x, y)dydx = 1; it may be unbounded and discontinuous. The 
same is true about fx(x). We will write (X,Y) ~ f if f = fxy. 

In Worked Example 2.1.8, the joint CDF of the two IQs is 

Fy y(z,y) = o( { (exe) :0< U1, 02 S 1, 


3/4 — 
— 80In(1 — 21) < x, —701n ae < yb) tay > 0) 


_ ( _ as) (1 _ ei) I(a,y > 0) = Fx (x) Fy(y). 


Naturally, the joint PDF is also the product: 


1 fe 
fay (ty) = soe VU(x > 0) re W/MI(y > 0). 


If we know the joint PDF fx y of two RVs X and Y, then their marginal 
PDFs fx and fy can be found by integration in the complementary variable: 


n= f trv z,y)dy, fry )= f tax x,y)d (2.1.32) 


This equation is intuitively clear: we apply a continuous analogue of the 
formula of complete probability, by integrating over all values of Y and 
thereby ‘removing’ RV Y from consideration. Formally, for all y € R, 


Fx(y) = P(X < y) = P(X < y,-7~ < Y < w) 


ae a fxy(z,2')da'dx 2 he gx(x)dz, (2.1.33) 
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where 
gx (x) =| fx.y(z, 2')da’. 
On the other hand, again for all y € R, 
y 
Fx(y) = / fx(x)dz. (2.1.34) 


Comparing the RHSs of equations (2.1.33) and (2.1.34), one deduces that 
fx = gx, as required. 


Example 2.1.13 Consider a bivariate normal pair of RVs X, Y. Their 
joint PDF is of the form (2.1.9) with d = 2. Here, positive-definite matrices 
© and ©! can be written as 


SH ( of = oor ) oa 1 ( Loe —r/(o102) ) 
— 2 o) o) 


aye —r/(o102) 1/02 


where 01, 02 are non-zero real numbers (with Oe , as > 0) and r is real, with 
|r| < 1. Equation (2.1.9) then takes the form 


1 —l 
Ixy(@y) = QnoyooV 1 — r? ein { 2(1 — r?) 
: [e =H)? _ 5, (@= mi) y= Ha), (Y= a } (2.1.35) 


O7 0102 0. A 
We want to check that, marginally, X ~ N(1,0?) and Y ~ N(p2, 03), ie. 
to calculate the marginal PDFs. For simplicity we omit, whenever possible, 
limits of integration; integral { below means (ie or tes an integral over 
the whole line. Write 


1 1 
Ix(®) ~ sae |? |-saay 


(x — pu)? (c— y)(y— pe) . (y— pe)? 
x | a 2r Bas + a bau 
1 


1 
QnojoeVv 1 — r2 [ex —3a =7?) 
OSs ONS Ss = 2 
. |e t a way 4 (2 Hz HM) |e 
OF 02 Oo! 


o—(#—11)?/20? 
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Here 
02 
Yi =Y-— pa, = r—(x— 1). 
O1 


The last factor in the braces equals 1, and we obtain that 
fx(z) — eo (emi)? /20% 


Vv 2701 
ie. X ~ N(t1, 07). Similarly, Y ~ N(p2, 03). 


The above ‘standard representation’ principle, where a real-valued RV 
X ~ f was identified as X(x%) = x for x € R with the probability P(A) 
given by formula (2.1.12), also works for joint distributions. The recipe is 
similar: if you have a pair (X,Y) ~ f, then take w = (x,y) € 2 = R’, set 


P(A) = ' f(x, y)dydzx for A C R? 
A 
and define 
NG) SRY (en) = ap. 


A similar recipe works in the case of general collections Xj, ..., Xn. 
A number of problems below are related to transformations of a pair 
(X,Y) to (U,V), with 


Sg (4 iV = oe X,Y) 
Then, to calculate the joint PDF fu,y from fx,y, one uses a formula similar 


to (2.1.27). Namely, if the change of variables (x,y) > (u,v), with u = 
g(x,y), V = go(x, y), is one-to-one on the range of RVs X and Y, then 


fu.v (u,v) = I((u, v) € Range (91, 92)) fx,y (x(u, v), y(u, v)) eae 


1.36) 


Here, 


Olasay. Ox/Ou Ox/Ov 
Ou,v) det ( Oy/Ou Oy/Ov ) 


is the Jacobian of the inverse change (u,v) +> (#, y): 


det Ox/Ou Ox/Ov 
R Oy/Ou Oy/dv 


_ Ou/Ox Ou/Oy \~ 
del Bie 7) ; 


O(x,y) 
O(u, v) 


guarantees that fu,v(u,v) > 0. 


The presence of the absolute value 
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In particular, the PDFs of the sum X + Y and the product XY are 
calculated as 


fx+y(u) = f fxy(eu-ajde= f fxy(u- away (2.1.37) 
and 


erate : fy (0.u/e) de = i fy (ufnu) ody (2.1.38) 


see equations (1.4.9) and (1.4.10). For the ratio X/Y: 


ysis / Preece (2.1.39) 


The derivation of formula (2.1.37) is as follows. If U = X + Y, then the 
corresponding change of variables is u = x + y and, say v = y, with the 
inverse x = u—v, y = v. The Jacobian 


hence 
fxsy,y (u,v) = fxy(u—v,v). 


Integrating in dv yields 


pes = / FeV, 


which is the last integral on the RHS of (2.1.37). The middle integral is 
obtained by using the change of variables u = x+y, v = x (or simply by 
observing that X and Y are interchangeable). 

The derivation of formulas (2.1.38) and (2.1.39) is similar, with 1/|z|, 
1/|y| and |y| emerging as the absolute values of the corresponding (inverse) 
Jacobians. 

For completeness, we produce a general formula, assuming that a map 
U1 = gi(X),..., Un = gn(xX) is one-to-one on the range of RVs Xj,..., Xn. 
Here, for the random vectors 
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with U; = gi(X), 


fU(Bije«. 5th) = | a fx(1(u),..-,2n(u)) 
x I(u € Range(1, --., gn)) - (2.1.40) 


Example 2.1.14 Given a bivariate normal pair X,Y, consider the PDF 
of X + Y. From equations (2.1.35) and (2.1.37), 


farted = es [ow aay [GE 


O71 

_ a ee St aos 2 

gp bt a Mau = 2) (u rh an 
0102 O95 


1 


1 _ 
—________ ] ex 
QrojoaV1 — =i 4 { 2( b=") 


2 _ 2 
. E op tit 4) x (uy — 21) |han. 


2 2 
OF 0102 05 


where 4] = @ — [1, Uy = U— fy — po. 
Extracting the complete square yields 


2 2 


se op E(u — £1) (ui — 1) 
a 0102 | os 
( Jo + 2rojo2 + 03 
= LY 
0102 
Ul o1+1ro2 ) us(1—r?) 
02 \/o% + 2rai02 + 0% © 0? 4+ 2rayo2 + 02" 
We then obtain that 
ex ot 
. 2(o7 + 2ro102 + 03) Lin 
fx+y(u) = fe /2dy 
Infor + 2rojoo + o3 


(w= pa = 2)? | 


exp | 
_ 2(0? + 2rojo2 + 02) (2.1.41) 


2n(o7 + 2raj02 + 03) 


Here, the integration variable is 


1 Jor +2rojog+03 Uy 1+ re2 
v= LY 
pape 0109 02 Jo? + 2rojo2 + 03 
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We see that 
X+Y0N (i + [2,0] + 2royo2 +05) , 


with the mean value jz; + 2 and variance o? + 2rajoq + oe: 


Another useful example is the following. 


Example 2.1.15 Assume that RVs X; and X92 are independent, and that 
each has an exponential distribution with parameter A. We want to find the 
joint PDF of 


Y, = X,+ Xq and Yo = X4/Xo, 


and check whether Y, and Y2 are independent. 
Consider the map 


XY 
T: (41,%2) > (y1,y2), where y,=21+ 22, yo= re 
2 
where 21, 72,41, y2 > 0. The inverse map T~! acts by 
= Y1Y2 Y1 
poh, > (11,2 where x1 = ———, wo = —— 
(y1, y2) ( 1, 2); 1 Lag, 2 em 


and has the Jacobian 


7 yo/(1+y2) y1/( + y2) — yrye/(1 + y2)? 
Se aa Go + y2) —y/(1 + yp)? ) 
Y1Y2 Y1 Y1 


(l+y)> (14ty)? (1+ yp)?’ 


Then the joint PDF 


Y1y2 Y1 ) Y1 
1+y2’ 1l+y (1 + yo)? | 


haw) =f xis ( 


Substituting fx,x, (11,22) = Ae7**1Ae~>”?, 1, 22 > 0, yields 


fyiyo(Y1, y2) = (me) | Y1,y2 > 0. 


(1+ a 
The marginal PDFs are 


CO 
fy (m1) = [ fyvyel(y1, y2)dy2 = Wyre" ™, -ys > 0, 
0 
and 


fy2(y2) = | fyiyo(yi, y2)dy = ve yo = 0. 


1 
(1+ ye 


As fy, yo(y1. 92) = fyi (y1) fyo(y2), RVs Yi and Y2 are independent. 
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The definition of the conditional probability P(A|B) does not differ from 
the discrete case: P(A|B) = P(A() B)/P(B). For example, if X is an expo- 
nential RV, then, for all y,w > 0, 


oo —Au 
P(X > Ae “du 
P(X >ytu|X>w)= (X2y+u) _ Syrw 


P(X >w) — feeder du 
eA(yt+w) ce 


This is called the memoryless property of an exponential distribution, which 
is similar to that of the geometric distribution (see equation (1.5.6)). It is not 
surprising that the exponential distribution arises as a limit of geometrics as 
p=e™ 71 (n— oo). Namely, if X ~ Exp(\) and X™ ~ Geom(e~/”) 
then, for all y > 0, 


( 2 ev) =P(X <y) = lim P(X™ < ny). (2.1.43) 


Speaking of conditional probabilities (in a discrete or continuous setting), 
it is instructive to think of a conditional probability distribution. Conse- 
quently, in the continuous setting, we can speak of a conditional PDF. 
See p. 166. 

Of course, the formula of complete probability and Bayes’ Theorem still 
hold true (not only for finite but also for countable collections of pair-wise 
disjoint events B; with P(B;) > 0). 

Other remarkable facts are two Borel—Cantelli (BC) Lemmas, named after 
E. Borel (1871-1956), the famous French mathematician (and for 15 years 
the minister for the Navy), and F.P. Cantelli (1875-1966), an Italian math- 
ematician (the founder of the Italian Institute of Actuaries). The first BC 
Lemma is that if B,, Bo,... is a sequence of (not necessarily disjoint) events 
with >/,; P(Bj) < oo, then the probability P(A) = 0, where A is the inter- 
section Nn>1 U j>n By. The proof is straightforward if you are well versed in 
basic manipulations with probabilities: if A, = U j>n By then Anyi © An 
and A = (),, An. Then A C A, and hence P(A) < P(A,) for all n. But 
P(An) < d2j>n P(B;), which tends to 0 as n — 00 because }7); P(Bj) < co. 
So, P(A) =0. 

The first BC Lemma has a rather striking interpretation: if }?; P(B;) < 
oo, then with probability 1 only finitely many events 6; can occur at the 
same time. This is because the above intersection A has the meaning that 
‘infinitely many events B; occurred’. Formally, if outcome w € A, then 
w € B; for infinitely many 7. 
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The second BC Lemma says that if events B,, Bo,... are independent and 
>; P( Bj) = ~, then P(A) = 1. The proof is again straightforward: P(A‘) = 


P(Uns1 An) < densi P(Ap)- Next, one argues that P(A;,) = P os Bs) = 


Ton P(B9) = exp {jon ln lt — P(B;)] } < exp [- jon P(By)], 28 nA - 
o). = =e for > 0 As. PCBs) = 00; PAZ) = 0: for “all a “Ehen 
P(A‘) = 0 and P(A) = 1. 

Thus, if B,, Bo,... are independent events and DBF P(B;) = oo, then ‘with 
probability 1 there occur infinitely many of them’. 


jan 


For example, in Worked Example 1.4.11 the events {year k is a record} 
are independent and have probabilities 1/k. By the second BC Lemma, with 
probability 1 there will be infinitely many record years if observations are 
continued indefinitely. 


Example 2.1.16 Here we discuss one intriguing application of the first 
BC Lemma. Suppose you have two indistinguishable non-symmetric coins. 
Coin 1 shows a head with probability p, and a tail with probability q1. 
Coin 2 shows a head with probability po and a tail with probability qo. The 
values p1,q1,P2,q2 are known but their attribution to coins is kept in the 
dark. We always assume that a head brings the profit 1 and the tail brings no 
profit. The problem is to develop an algorithm of coin selection for successive 
tosses such the the empirical frequency of the heads approaches max(p1, p2]. 
There is an additional constraint: the selection of a coin for the nth toss 
may be based in the following information: (a) the number of the trial n, 
and (b) the selection of the coins during the trials n — 1 and n — 2 and 
the results of these trials. This additional constraint appears as a restriction 
on the memory. Alternatively, we are interested to find out the minimal 
information about the past that is sufficient to recognise the best coin. 

Here we present an algorithm proposed in [32]. Let us split all the 
trials 1,2,...on the successive periods of teaching and accumulation. 
We denote these periods by 7),A,,7%>,Ao,..., and their lengths by 
|T1|,|A1|, |Z2|,|Ae|,.... The first period (including the first two tosses) is 
teaching. Let a teaching period always start with coin 1 and the length of 
the teaching period always be even. This algorithm guarantees that 


#{heads after n tosses} /n — max|p1, py]. 


Select the length of the kth teaching period |T;| = 2c(k)m/(k); we say that it 
consists of m(k) pieces of length 2c(k). We start the first piece with a tossing 
of coin 1; this coin has a priority. We say, that the coin used in the first toss 
on a piece j = 2,...,m/(k) has a priority on this piece. The priority coin is 
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selected by a special rule based on the results of trials on the previous piece, 
j-1. 

Suppose that the coin / = 1,2 has a priority on the piece j. If this coin 
shows a head at the first trial, it preserves its priority and the next piece 
j +1 is again opened with the same coin. If the coin / shows a tail then the 
alternative coin /* is used on the next toss. If the coin /* shows a tail we 
return to the coin / and use it for all the remaining 2c(k) — 2 tosses. In this 
case the coin | keeps its priority in the series 7 + 1. If the coin /* shows a 
head we use the coin / on the next trial and use the same rule (if 1 shows 
a head it keeps its priority; in the other case the coin /* would be given a 
chance to challenge the priority of the coin 1). The change of priority from 
| to /* occurs if and only if the coin / fails all c(k) trials and the alternative 
I* showed heads in all successive c(k) trails. In this case the coin /* obtains 
priority and will open the next series, 7 + 1. 

In such a manner we reach the end of a teaching period and only the 
priority coin will be used during the next accumulation period. We’ll give 
a few examples: in the first case the priority of coin 1 is preserved, in the 
second case the priority of coin 2 is preserved, in the third case the priority 
of coin 2 is shifted to coin 1. The symbol X could take values AH or T: 


1: 2 
T 4 
2 1 2 
T AT 


ee ee ee ee 


2 1 1 71 ~«21 1 1 
A AX X X...X XxX 


oe 


Spe 


2 De 2. os 2reie 2. <2 
Xx X X X...X X 


T A T AT AT 4H...T A 


Now we specify the parameters c(k), m(k) and |A,|. Let us select c(k) = k 
(any linear function of k would fit). Next, select |A;,| from the condition 


2 


J 


mje) + >) IAgl } /|Ael > 1 


k k-1 
=i j=l 


as k — oo. The selection of m(k) is a bit more delicate. The random process 
of changing priorities is described by a Markov chain with two states (see 
vol. II of this series for details). The transition probability 1 — 2 equals 
(q1p2)°) and the transition probability 2 > 1 equals (q2p,)°\). For m(k) 
large enough the probabilities to have coin 1 or 2 as a priority one is close 


= 1 ==, 30S é 
to m = ETO} and 72 = Een respectively (and the errors decrease 
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geometrically with the growth of m(k)). Here a = re The probabilities 


(77,72) are the invariant distribution of the Markov chain with two states. 
Now we use the first BC Lemma. Without loss of generality we assume 
that a < 1, i.e. pi > pg. In this case the probability that coin 1 is not the 


priority one after the kth teaching period is bounded by ee < 2a%), 


The series 772, a%(*) < 00 for c(k) = k. By the first BC Lemma, we 
identify the correct coin 1 after a finite number of teaching periods. 


In the continuous case we can also work with conditional probabilities of 
the type P(A|X = x), under the condition that an RV X with PDF fx 
takes value «. We know that such a condition has probability 0, and yet all 
calculations, while being performed correctly, yield right answers. The trick 
is that we consider not a single value x but all of them, within the range 
of RV X. Formally speaking, we work with a continuous analogue of the 
formula of complete probability: 


P(A) = ve P(A|X = x) fx(x)da. (2.1.44) 


This formula holds provided that P(A|X = x) is defined in a ‘sensible’ way, 
which is usually possible in all naturally occurring situations. For example, 
if fx y(a,y) is the joint PDF of RVs X and Y and A is the event {a < Y < 
b}, then 


P(A|X = x) = i frvteaday | f trou) 


The next step is then to introduce the conditional PDF fy\x(y|x) of RV 
Y, conditional on {X = x}: 


fy\x(ylz) = aes (2.1.45) 


and write natural analogues of the formula of complete probability and the 
Bayes formula: 


fri) = f frixtube)tete)ae, frixtule) = str / f fxwlledfr ae 
(2.1.46) 
As in the discrete case, two events A and B, are called independent if 


P(A{ )B) = P(A)P(B); (2.1.47) 
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for a general finite collection Ai,...,An,n > 2, we require that for all k = 
2,...,.nand1<2 <---<ip <n, 


P{ () A; |= [[ PCAs). (2.1.48) 
1<j<k 1<j<k 
Similarly, the RVs X and Y, on the same outcome space Q, are called 
independent if 


P(X <2,Y <y) =P(X <z)P(Y <y) forallz,yeER. (2.1.49) 


In other words, the joint CDF Fx y(x,y) decomposes into the product 
Fx («)Fy(y). Finally, a collection of RVs X1,...,X» (again on the same 
space 2), n > 2, is called independent if, for all k = 2,...,n and 
Lycee <Sacn, 


1<j<k 


Remark 2.1.17 We can require equation (2.1.50) for the whole collection 
Xj, ..., Xn only, but allowing y; to take value +oo: 


PRX Ui en hh a) II P(X; < yj); Y1y---5yn € R= RU {oo}. 
1<j<n 
(2.1.51) 
Indeed, if some values y; in equation (2.1.51) are equal to oo, it means that 
the corresponding condition y; < oo is trivially fulfilled and can be omitted. 
If the RVs under consideration have a joint PDF fx,y or fx, then equa- 
tions (2.1.50) and (2.1.51) are equivalent to its product decomposition: 


fxy(a,y) = fx(e) fry), Fx(x) = fx, (21)--+ fx, (en): (2.1.52) 
The formal proof of this fact is as follows. The decomposition F’y y(y1, y2) = 


Fx (yi)Fy(y2) means that for all y,, y2 € R: 


V1 Y2 Y1 Y2 
i Pueae= . ile) eUNanaie 


—oo J—oo —co J— 


One deduces then that the integrands must coincide, i.e. 


fxy(2,y) = fx(2)fy(y). 


The inverse implication is straightforward. The argument for a general col- 
lection X1,..., Xn is analogous. 
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For independent RVs X, Y, equations (2.1.37)—(2.1.39) become 
fx+y(u )= f txte \fyv(u-« ws u—y)fy(y)dy, (2.1.53) 
fxy(u )= f txte ) fy (w/a) ae= f fx u/y)fyy va | dy = (2.1.54) 


and 


ce y= | ui FxCuuyfv Gay (2.1.55) 


See also Worked Example 2.1.20. As in the discrete case, equation (2.1.53) 
is called the convolution formula (for densities). 

The concept of independent, identically distributed (IID) RVs employed 
in the previous chapter will continue to play a prominent role, particularly 
in Section 2.3. 


Worked Example 2.1.18 A shot is fired at a circular target. The vertical 
and horizontal co-ordinates of the point of impact (taking the centre of the 
target as origin) are independent RVs, each distributed N(0, 1). 

Show that the distance of the point of impact from the centre has the 
PDF re-"’/2 for r > 0. Find the median of this distribution. 
Solution In fact, X ~N(0,1), Y ~N(0,1) and 

fxy(z,y) = (1/27 )e7 @ “ty ae » BYE R. 

Set R= V/X?2+Y? and © = tan 1(Y/X). The range of the map 


is {r >0, 6 € (—a,7)} and the Jacobian 
O(r, 0) cet e( Or/Ox Or /dOy ) 


x,y) 00/Ox 06/dy 
equals 
ad ¥ 
r r 1 
det ~y/ x? 1/e = Fi x,y €R, r>0, 6€ (-7,7). 


Ley/e}? Dye) 


Then the inverse map 
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has the Jacobian O(2, y)/O(r, 9) =r. Hence, 
fr,o(0,6) = re" I(r > 0) s-T (=n). 
Integrating in dé yields 
fr(r) = re"? I(r > 0), 


as required. To find the median m(R), consider the equation 


y oo 
u re? dp = | re” 2dr, 
0 y 


giving e~”/2 = 1/2. So, m(R) = Vin4. 


We conclude this section’s theoretical considerations with the following 
remark. Let X be an RV with CDF F and let 0 < y < 1 be a value taken 
by F, with F(a*) = y at the left point z* = inf[x¢ ER: F(x) > y]. Then 


P(F(X) <y)=y. (2.1.56) 


In fact, in this case the event {F(X) < y} = {X < a*}, implying F(a*) = 
P(F(X) < y). In general, if g : R — R is another function and there is 
a unique point a € R such that F(a) = g(a), F(x) < g(x) for x < a and 
F(a) > g(x) for x > a, then 


P(F(X) < g(X)) =g(a). (2.1.57) 


Worked Example 2.1.19 A random sample X1,..., Xon+41 is taken from 
a distribution with PDF f. Let Y1,...,Yan41 be values of X1,...,Xean41 
arranged in increasing order. Find the distribution of each of Y;, k = 
Lycee, 20 1: 


Solution If X; ~ f, then 


2n+1 


Py,(y) = PUM <9) = Do ( 


jak 


2n+1 


5 FO) L- FOP, ver. 


The PDF fy,, k =1,...,2n +1, can be obtained as follows. Y; takes value 


x if and only if k—1 values among X1,...,Xan+41 are less than x, 2n+1-—k 
are greater than x, and the remaining one is equal to x. Hence 
(2n + 1)! 


fy,(@) = [F(a)|F[1 — F(a) Pe f(a). 


(k—DIQn+1—h)! 
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In particular, if X; ~ U[0, 1], the PDF of the the sample median Y;,+41 is 


Cnt Neda)" 


Wie a4 (x) a 


Worked Example 2.1.20 Let X and Y be independent RVs with respec- 
tive PDFs fx and fy. Show that Z = Y/X has the PDF 
Ai i Fea ae: 


Deduce that T = tan~!(Y/X) is uniformly distributed on (—7/2,7/2) if 
and only if 


1 
[ te@fr(eo|elae =a 


for z € R. Verify that this holds if X and Y both have the normal distribution 


with mean 0 and non-zero variance o?. 


Solution The distribution function of RV Z is 


Fz(u) =P(Z <u) =P(Y/X <u,X >0)+P(Y/X <u, X <0) 
oo Ux 0 58 
ae [tx / fy (y)dydx + i f(v) f fy (u)ayde. 
: See —oo UL 


Then the PDF 


ioe) 0 
falu) =< Fe(w = f fx(o)fy(ux)nde - i, fx (x) fy (ux)axda 
0 —oo 
= f fx(0)Fy(ua) alae, 


which agrees with the formula obtained via the Jacobian. 
If T is uniformly distributed, then 


tan! 2 
Fe(u) =P(tan Z <tan“ty) = eta? 
T 
d 1 
fa(u) du z(u) 1(u? + 1) 
Conversely, 
1 er Bote oil 
fz(a) = mae) implying F'z(u) = ~ tan ut 5" 
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We deduce that 


1 1 
ME Su) = eS 
1 


or fr(u) = 1/n on (—7/2, 77/2). 
Finally, take 
fx(z) = ee fy@= 


Then 


[ tx fv(ux)lolax % A, [elo ada 
0 


_ 1 (=o? oe (14u)/207\ | 
wo? \u2+1 é 

ret 

m1 + U2)’ 


Worked Example 2.1.21 Let X , X2,... be independent Cauchy RVs, 
each with PDF 

= d 

— w(d? + 22)’ 


Show that A, = (X1, + X2+---+X,)/n has the same distribution as X1. 


f(x) 


Solution For d= 1andn=2, f(x) =1/[r(14+ 2?)]. Then Ag = (X1+-X2)/2 
has the CDF F'4,(x) = P(X1 + X2 < 2x), with the PDF 


7 o 1 1 du 
g(x) = 2g(22), where g(y) = 5 i 1+u21+(y—u)?’ 


Now we use the identity 


1 —  a(m+1) 
"| peter! = Ga 


which is a simple but tedious exercise (it becomes straightforward if you use 
complex integration). This yields 


which implies 
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In general, if Y = ¢X1 + (1 — q)X2, then its PDF is 


d (x—qu)/(1—) 1 xr qu 
= fry fo penaytu= fe (EE) see 


which, by the same identity (with m = (1 — q)/q), is 


Cif =¢ 1 
at q +1) Tee ye 1) 
Now, for d = 1 and a general n we can write S, = [(n — 1)S, + Xp]/n. 
Make the induction hypothesis S,_1 ~ f(x). Then by the above argument 
(with two summands, S,_; and X;,), Sn ~ f(x). 
Finally, for a general d, we set X; = X;/d. Then S, ~ f(x) =1/{[x(1+2?)]. 
Hence, for the original variables, S;, ~ f(x) = d/[m(d? + x?)]. 


Remark 2.1.22 A shorter solution is based on characteristic functions; 
see equation (2.2.34). For the proof, see Problem 2.3.25. 


Worked Example 2.1.23 If X,Y and Z are independent RVs, each uni- 
formly distributed on (0,1), show that (XY)4 is also uniformly distributed 
on (0, 1]. 


Solution Take 
In[((XY)4] = Z(InX + mY). 


To prove that (XY)? is uniformly distributed on [0,1] is the same as to 
prove that —Z(In X + InY) is exponentially distributed on [0,00). Now 
W =—InX —InY has the PDF 


ye 4, O<y<wm, 
0, y= 0; 


fwly) = 


The joint PDF fz w(a,y) is of the form 


pleu ye ¥, if0<a<10<y<o, 
Z,W\2, = . 

‘ 0, otherwise. 
We are interested in the product ZW. It is convenient to pass from x,y 
to variables u = ry,v = y/«x with the inverse Jacobian 1/(2v). In the new 
variables the joint PDF fu,y of RVs U = ZW and V = W/Z reads 


2 (uv)t/2—- (ue)? t,o S00 <0 /o-< 1, 


0, otherwise. 


fu,v(u, v) rad 
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The PDF of U is then 


w= f for U, v) aw= [” ae vy)? en (wy? aay 
Bu | 


Poe fale) =p, 


with fu(u) = 0 for u <0. 


Worked Example 2.1.24 Let RVs X1,...,Xn be independent and ex- 
ponentially distributed, with the same parameter . By using induction on 
n or otherwise, show that the RVs 


max [Xj,...,X,] and Xt 5Xeto + Xn 
have the same distribution. 
Solution Write 
Vy = maxi Ng, sy Xl Za = AT SXa bet oe 
Now write 
P(Yn <y) = (P(X <y))” =(1-e7)”. 
For n= 1, Y, = Z. Now use induction in n: 


1 
P(iA.< 9) =P (Zn-1 + non < v) : 


As X,,/n ~ Exp(nd), the last probability equals 


y 
fa-e aeym Inde nA(y dz 
0 
y 
=nrem™ f (1 - a n-1 el? Digs nemo f(a yaa ldu 
0 1 


= ney 7 vw" tdy =e ™Y(e*# — 1)" = (1-9). 
0 


Worked Example 2.1.25 Suppose a > 1 and Xq is a positive real-valued 
RV with PDF 


jo)= Ato exp (-t™) 
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for t > 0, where Ag is a constant. Find Aq and show that, ifa@ > 1 and s,t > 
0, 


P(Xq >s4+t|Xq >t) < P(X > 8). 
What is the corresponding relation for a = 1? 


Solution We must have [5° fa(t)dt = 1, so 


(oe) 


A, = [ertexp (-1")at 


=q! [ewe ac) =q! [eo ]5 =a, 
0 


and Ay = a. If a > 1, then, for all s,¢ > 0, 


P(Xa>stt) — Jepeexp (—u*)d(u*) 


P(Xa>t) f° exp (—u)d(u®) 


P(Xq >s+t\Xq >t) = 


exp (—(s + t)®) a a 
= =exp(t" —(s+t 
ae p(t — (s+ 4)°) 
= exp (—s° + negative terms) 


Ifa=1, t*—(s+t)® = —s, and the above inequality becomes an equality as 
P(Xq >t) =exp(-—t). (This is the memoryless property of the exponential 
distribution.) 


Remark 2.1.26 If you interpret Xq as the lifetime of a certain device 
(e.g. a light bulb), then the inequality P(Xq > s+t|Xq > t) < P(Xa => s) 
emphasises an ‘ageing’ phenomenon, where an old device that had been in 
use for time t is less likely to serve for duration s than a new one. There 
are examples where the inequality is reversed: the quality of a device (or an 
individual) improves in the course of service. 


Worked Example 2.1.27 Let a point in the plane have Cartesian co- 
ordinates (X,Y ) and polar co-ordinates (R, ©). If X and Y are independent, 
identically distributed (IID) RVs, each having a normal distribution with 
mean zero, show that R? has an exponential distribution and is independent 
of O. 
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Solution The joint PDF fx y is (1/2m0?)e—(@" +9") /207 and R and © are 
defined by 


he 


R=T=X*4+yY?, O=tan! 
Then 


free (t,9) = fx,y (ae 9), y(t, 0)) L(t > 0)I(0 < 6 < 2r) g 


where the inverse Jacobian 


ane = (269)" aet|( afte? 2+ y?) seus) 
1 
a 


Hence 


froolt,0) = ye?” I(t > 0)1(0 < 8 < 2x) = fra(t) fol), 


with 


nai Hae Lee 
fre(t) = 552° I(t>0), fe(@)= 5 1(0 < @ < 2n). 


Thus, R? ~ Exp (1/207), @ ~U(0, 27) and R? and © are independent. 


Worked Example 2.1.28 Planet Zog is a sphere with centre O. Three 
spaceships A, B and C land at random on its surface, their positions being 
independent and each uniformly distributed on its surface. Calculate the 
probability density function of the angle ZAOB formed by the lines OA 
and OB. 

Spaceships A and B can communicate directly by radio if ZAOB < 7/2, 
and similarly for spaceships B and C' and spaceships A and C. Given an- 
gle ZAOB = ¥ < 7/2, calculate the probability that C can communicate 
directly with either A or B. Given angle ZAOB = y > 7/2, calculate the 
probability that C can communicate directly with both A and B. Hence, or 
otherwise, show that the probability that all three spaceships can keep in 
touch (with, for example, A communicating with B via C if necessary) is 
(a + 2)/(47). 


Solution Observe that 
P(ZAOB € (7,7 + 67)) ~ dysiny. 
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A y-n/2 I 


y—2(y—m/2) = t-y 


y—n/2 
Figure 2.17 


As ce sinydy = 2, the PDF of ZAOB is f(y) = 5siny. Two cases are 
possible depending on the value of ZAOB. If ZAOB = y < 4, then the 
probability that C’ can communicate directly with either A or B is a. If 
ZAOB = y > §, then the probability that C can communicate directly with 
both A and B is 5. In fact, let the spacecraft C be in between A and B. 
The shadow zone where C could communicate with A but not with B is an 
angle y—7/2. The shadow zone where C' could communicate with B but not 
with A is also an angle y — 7/2. So, the admissible zone (i.e. visibility zone) 
is an angle y — 2(y — 7/2) =a — 7. This is shown in Figure 2.17. Hence, 


P(all three spaceships can keep in touch) 


n/2 ma+y a T+ 
= d ay 
7 riGe om ve te a, 27 
nr /2 
ee ea ee 
= d 
i poe ag 
=f en 1 i _mt2 
=a ; sin ydy on Jy ysinydy = — 


Worked Example 2.1.29 (a) Let Aj, Ao,...,A, be events such that 
A; Aj; = 0 for i 4 7. Show that the number N of events that occur satisfies 


P(N=0) = 1- 5 P(A). 
i=1 


(b) Planet Zog is a sphere with centre O. A number N of spaceships land 
at random on its surface, their positions being independent, each uniformly 
distributed over the surface. A spaceship at A is in direct radio contact with 
another point B on the surface if ZAOB < 7/2. Calculate the probability 
that every point on the surface of the planet is in direct radio contact with 
at least one of the N spaceships. 
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Hint: The intersection of the surface of a sphere with a plane through the 
centre of the sphere is called a great circle. You may find it helpful to use 
the fact that N random great circles partition the surface of a sphere into 
N(N — 1) +2 disjoint regions with probability 1. 


Solution (a) We have 


P(N =0) =P(ASN---N A®) =1 —P(L_ As) =i 353 P( Aj). 


i=1 


(b) The position of a spacecraft determines its zone of radio-availability. 
The boundary of a zone is a great circle on the sphere. For a given great circle 
the spacecraft may be in any of two hemispheres with equal probabilities. 
Let N great circles be given. Denote by Aj,..., Ay(y), where 


f(N) = N(N - 1) +2, 


all domains on the surface of the sphere created by these great circles. Define 
an event 


Ey = 


{points of domain A; have no connection with either of the spacecraft}. 
Then P(E;) = 1/2". Hence, 


P(each point on the surface has a connection with some spacecraft ) 


sche f(N) 
=1-P (US Fi) =1- 5° P(E) 
i=1 
1 N(N —1)+2 
=1- fW)gy 1 - NA? 


as Ey Ej = —) for i #4 7. Observe that A; and A; lie on different sides of at 
least one great circle. 

Now we give a short proof of the statement in the hint. Generically, N 
great circles split the surface of the sphere into f(NV) non-intersecting do- 
mains. Add one more great circle. Generically, it intersects any of the existing 
circles in two points. Any such intersection splits an existing domain into 
two. So, 


f(N-+1) =f(N)+2N, fC) =2. 
This recurrence relation implies 


FIN) = 24-00 4 046 PN —1) =24 NIN = 1). 
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Here the term ‘generically’ means that the property holds with probabil- 
ity 1. 


Worked Example 2.1.30 Points X,, Xo,... are thrown in a plane inde- 
pendently of each other, with the same probability distribution. This distri- 
bution is unknown, the only available fact being that the probability that 
any given three random points X;, X; and Xz, with 1 <i<j <k, lieona 
line equals 0. Prove the following inequalities. 

(a) The probability p®) that one of the angles in the triangle with vertices 
at X41, X2 and Xz is at least 27/3 is greater than or equal to 1/20. 

(b) The probability p that points X,, X2, X3 and X4 form a convex 
quadrilateral is at least 1/5. 

Hint: An elegant argument is to consider more points. For instance, in 
part (a) you may take six random points X1,...,X¢ and check that 


6 
EN = (3) 
(5), 


where N is the random number of triangles with an angle > 27/3. The 
problem is then reduced to proving that EN > 1. You can analyse two cases: 


(i) the convex hull spanned by X),..., X¢ is either a triangle, a quadrilateral 
or a pentagon; and (ii) it is a hexagon. In the former case, at least one of 
the points lies inside a triangle formed by the other three. 

Similarly, in part (b) you may take five points Xj,...,Xs5 and consider 
the random number M of convex quadrilaterals among the emerging five. 
Again, it may be helpful to analyse the convex hull spanned by Xj,..., X5. 


Solution (a) Following the hint, consider six IID random points Xj,..., X¢; 
we know that with probability 1 no three points among them lie on a line. 
Denote by roe the probability that the triangle X;,Xj;,X, has an angle 
> 27/3. With N as in the hint, we have that 


By (OVS) a PON 
S> v= (3)e , with € = 20. 


1<i<j<k<6 


= 
I 


Hence, it suffices to prove that EN > 1. 


In fact, a stronger property holds true: N > 1 with probability 1. Indeed, 
the convex hull of points X1,...,X¢ is either (i) a triangle, (ii) a quadri- 
lateral, (iii) a pentagon or (iv) a hexagon. In cases (i)—(iii) we have a point 
among X1,...,Xg which lies inside a triangle formed by the other three 
points. If, say, X4 lies inside the triangle formed by X1, X» and X3, the 
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sum of the angles X;X4X9, X2X4X3 and X3X4Xj, equals 27, and thus at 
least one of them is > 27/3. 

In the remaining case (iv), the sum of the angles of the hexagon equals 
4m and hence at least one of them is again > 27/3. 


(b) Similarly, 
5 5 
RA — (4) = 
1M (3)? , where (;) 5. 


We again need EM > 1 and will check a stronger fact, that M > 1 with 
probability 1. 


Indeed, since no three points lie on a line, we have that points X1,..., X5 
span either: (i) a convex pentagon, (ii) a convex quadilateral, or (iii) a tri- 
angle. 

In cases (i) and (ii), M > 1, obviously. 

In case (iii), two of the points lie inside a triangle formed by the other 
three. Suppose that, say, X4 and Xs5 fall inside the triangle X1, Xo, X3. 
Then, given the position of X4, the triangle is partitioned into six domains 
(see Figure 2.18), and it is evident that, wherever point X5 falls, we will find 
a convex quadrilateral. Hence, as before, M > 1. 


Worked Example 2.1.31 (a) Let X and Y be two independent uniformly 
distributed RVs on [0,1]. Prove that EX*® = and E(XY)* = — 
k+1 (k+ 1)? 

and find the expected value E(1 — XY)*, where k is a non-negative integer. 

(b) Let (X41, Y1),.--,(Xn, Yn) be n independent random points of a unit 

square S = {(z,y): 0 < x,y < 1}. We say that (X;,Y;) is a maximal 


x; 


poy, 


Figure 2.18 
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Figure 2.19 


external point if, for each 7 = 1,...,n, either X; < X; or Y; < Yj. (For 
example, in Figure 2.19 there are three maximal external points.) Determine 
the expected number of maximal external points. 


Solution (a) 


Further, 


R(AY)* = (RX*)(RY*) = 7a (by independence). 


Next, by following the hint, for all k = 0,1,..., 


ER -—XY)k= SO (Soy R(XY)'= (‘) oa 


a 
O<i<k O0<i<k 


This expression may be simplified as follows: 


Pale eat, = i f Soe 


1 7 : k+l1)\ . 
aes = _ jl a 
=oel ee) ( i Jee 
i. on ee 1 _ ght 
_ iil [> (1-2) as 1 [> or de 
0 0 


k+l x ~~ k+1 l—« 
1 : r 1 1 
= —— |} (l¢at+-:-+2*)\dr=— JO -. 
Rod Jo RET aa 
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(b) Let B denote the number of maximal external points. By linearity and 
symmetry, 


B= 1(i a border point), whence EB = nP(1 a border point). 
1<i<n 


Next, by conditioning and uniformity, 
nP(1 a border point) = nE[E(1(1 a border point) |(X1, ¥1)| 
= nE|P(no point with X; > X, and Y; > Yi|(X1, ¥i))] 


n—-1 


= nE[1—-(1- X1)(-%4)]" =nE(1- 1%)" 


Now, by using the final part of (a), with k = n—1, 


nE(1— X1¥\)" ith a ("7 ') 


0<i<n-1 


By the previous remark, the answer is EB = Ss" =. 

ere 

Problem 2.1.32 Random variables X and Y are independent and expo- 
nentially distributed with parameters A and wp. Set 


U Simax |X. Y |, V Sam (X,Y, 


Are U and V independent? Is RV U independent of the event {X > Y} (ie. 
of the RV I(X > Y))? 

Hint: Check that P(V > y1, U < yo.) P(V > m1)P(U < yo) and P(X > 
YU <y)FPX Ss YIP <a): 


Problem 2.1.33 The random variables X and Y are independent and 
exponentially distributed, each with parameter 4. Show that the RVs X + Y 
and X/(X + Y) are independent and find their distributions. 

Hint: Check that 


fu,v (u,v) = Mua I(u>0)1(0 <u <1) = fy(u)fv(v). 


Problem 2.1.34 (i) X and Y are independent RVs, with continuous sym- 
metric distributions, with PDFs fx and fy, respectively. Show that the PDF 
of Z= X/Y is 


h(a) =2 i. an Oa 


(ii) X and Y are independent normal RVs distributed N(0,07) and 
N(0, 77). Show that Z = X/Y has PDF h(a) = d/[r(d?+a?)], where d = a/t 
(the PDF of the Cauchy distribution). 
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Hint: 
co pay 
Fe(a)=2 | [ px(u)fru)aody, 
0 —oo 


2.2 Expectation, conditional expectation, variance, 
generating function, characteristic function 


Tales of the Expected Value. 
(From the series ‘Movies that never made it to the Big Screen’.) 


They usually have difficult or threatening names such as 
Bernoulli, De Moivre-Laplace, Chebyshev, Poisson. 
Where are the probabilists with names such as Smith, 
Brown, or Johnson? 

(From the series ‘Why they are misunderstood’. ) 


The mean and the variance of an RV X with PDF fx are calculated similarly 
to the discrete-value case. Namely, 


EX = [ete(oez and VarX = [e- EX)? fx (ada. (2.2.1) 


Clearly, 


ene i (x? — 2X + (EX)?) fx(x)de 


= | Ptx(iar 28x | xfx(ayar+ (Ex)? | fx(oaz 


— EX? — 2(EX)? + (EX)? = EX? — (EX)’. (2.2.2) 


Of course, these formulas make sense when the integrals exist; in the case 
of the expectation we require that 


[ \2ltx(@)ar < ov. 


Otherwise, ie. if f |x| fx(a)dx = oo, it is said that X does not have a finite 
expectation (or EX does not exist). Similarly, with Var X. 


Remark 2.2.1 Sometimes one allows a further classification, regard- 
ing the contributions to integral f xfx(x)dx from fe and Haeee Indeed, 


Jo. «fx(a)da corresponds to EX, while [o.(—2) Fx (a) da to EX_, where 
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X4 = max [0,X] and X_ = —min (0, X]. Dealing with integrals {>° and 
ex, is simpler as value x keeps its sign on each of the intervals (0, co) 
and (—oo,0). Then one says that EX = oo if {5° afx(x)dz = oo and 
fo(—2) fx (w)da < oo. Similarly, EX = —oo if {>° xfx(x)dz < oo but 
[°..(—2) fx(a)dar = ov. 


Formulas (2.2.1) and (2.2.2) are in agreement with the standard represen- 
tation of RVs, where X(x) = x (or X(w) = w). Indeed, we can say that in 
EX we integrate values X(x) and in Var X values (X (a) —EX)? against the 
PDF fx(x), which makes strong analogies with the discrete case formulas 
EX = >>, X(w)px(w) and Var X = >, (X(w) — EX)?p(w). 


Example 2.2.2 For X ~ U(a,b), the mean and the variance are 


b+a (b—a)? 
wy = — = A 2.2. 
ve 5 VarX i (2.2.3) 


In fact, 


1 B 1 x 
x= | aay eae (=) 


a 2 0-a a. * 
the middle point of (a,b). Further, 
saa f = spray 750 +ab+a*). 


Hence, VarX = 3(b? + ab +a?) — F(b+.a)? = 7(b- a)’. 


Example 2.2.3 For X ~ N(u,07): 


xX = Varx ao. (2.2.4) 


In fact, 


EX = —— | xexp |] de 


270 
ne! | (a — yw)? 
ae fe ju + 4) exp —- dx 
1 x x 
= re | [vex (-) d+ f exp (-=5) a 


2 

m x 
=0O+ Je (-5) ae = : 
V 210 E 2 . 
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and 


In fact, 
EX =X * peda = A [ ze “dr = = 
0 X Jo r 
and 
afr 26 da = a [ ae "da = ” 
implying that VarX = 2/d? —1/d? = 1/)?. 


Example 2.2.5 For X ~ Gam(a, \), 
oe a 
EX = x? VarX = 2s 


In fact, 
EX = a fa ae a a 
(a) AT (a) 
0 
_. (a+ 1) _ a 
=o Day: = oy 
Next, 


This gives VarX = (a+ 1)a/\? — a?/d? = a/d*. 


(2.2.5) 


(2.2.6) 
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Example 2.2.6 Finally, for X ~ Ca(a,7r), the integral 


TIX 


which means that EX does not exist (let alone Var X). 


A table of several probability distributions is given in the Appendix. 

In a general situation where the distribution of X has a discrete and 
an absolutely continuous component, formulas (2.2.1), (2.2.2) have to be 
modified. For example, for the RV W from equation (2.1.26): 


EW = [0-P(W =0)] + a fe ae HAT g = onions, 
Le Jo up — A)? 
An important property of expectation is additivity: 
W(X +Y)=EX +EY. (2.2.8) 


To check this fact, we use the joint PDF fx y: 


EX +EY = [eters f ufyway 


= fof tev@udyar+ fy f trv(eydray 


= [fe + y)fxy (2, y)dydx = E(X + Y). 


As in the discrete-value case, we want to stress that formula (2.2.8) is a 
completely general property that holds for any RVs X, Y. It has been derived 
here when X and Y have a joint PDF (and in Section 1.4 for discrete RVs), 
but the additivity holds in the general situation, regardless of whether X and 
Y are discrete or have marginal or joint PDFs. The proof in its full generality 
requires Lebesgue integration and will be addressed in a later chapter. 

Another property is that if c is a constant then 


E(cX) = cEX. (2.2.9) 


Again, in the case of an RV with a PDF, the proof is easy: for c 4 0: 


(eX) = f xfex(ade= [ r2fx(ulode =e f xfy(x)ae 


when c > 0 the last equality is straightforward and when c < 0 one has 
to change the limits of integration which leads to the result. Combining 
equations (2.2.8) and (2.2.9), we obtain the linearity of expectation: 


(cy. Xx + c2Y ) = c, EX + EY. (2.2.10) 
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It also holds for any finite or countable collection of RVs: 


E (= «ax = a ql 


provided that each mean value EX; exists and the series }>, |cjEX;| con- 
verges. In particular, for RVs X1, X2, ... with EX; = pw, E(>>}, Xi) = np. 

A convenient (and often used) formula is the Law of the Unconscious 
Statistician: 


Ga 


x (2.2.11) 


Eg(X ) = f o(2)fx (Jae, (2.2.12) 


relating the mean value of Y = g(X), a function of an RV _X, with the PDF of 
X. It holds whenever the integral on the RHS exists, ie. [ |g(x)|fx(x)dx < 
oo. 

The full proof of formula (2.2.12) again requires the use of Lebesgue inte- 
gration, but its basics are straightforward. To start with, assume that func- 
tion g is differentiable and g' > 0 [so g is monotone increasing and hence 
invertible: for all y in the range of g there exists a unique x = x(y)(= 
g ‘(y)) € R with g(x) = y]. Owing to equation (2.1.28), 


dy. 


(X= futuuddy=[— yfxewan 


Range (g) g (x(y)) 


The change of variables y = g(x), with x(g(x)) = a, yields that the last 
integral is equal to 


[fala Toae= | glo) fx(oax, 
g(x) 
ie. the RHS of formula (2.2.12). A similar idea works when g’ < 0: the minus 
sign is compensated by the inversion of the integration limits —oo and oo. 
The spirit of this argument is preserved in the case where g is continuously 
differentiable and the derivative g’ has a discrete set of zeroes (i.e. finite or 
countable, but without accumulation points on R). This condition includes, 
for example, all polynomials. Then we can divide the line R into intervals 
where g’ has the same sign and repeat the argument locally, on each such 
interval. Summing over these intervals would again give formula (2.2.11). 
Another instructive (and simple) observation is that formula (2.2.12) holds 
for any locally constant function g, taking finitely or countably many values 
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yj: here 


Eg(X) = 5° yjP(g(X) = yy) 
j 


=) 45 | fx(2)L(x: g(x) = y;)dx = | g(x) fx(x)de. 
Fv [rnc one 


By using linearity, it is possible to extend formula (2.2.12) to the class 
of functions g continuously differentiable on each of (finitely or countably 
many) intervals partitioning R, viz. 


a 0, «<0, 
Qe) = 
z, «£20. 


This class will cover all applications considered in this volume. 
Examples of using formula (2.2.12) are the equations 


b 
EI (a< X <b) =F) fripds—Paa x <b) 


and 


b 
EXI(a< xX <b)= i: xfx(x)da. 


A similar formula holds for the expectation Eg(X,Y ), expressing it in 
terms of a two-dimensional integral against joint PDF fx y: 


Rg xs) = ih g(x, y) fx y (ax, y)dady. (2.2.13) 


Here, important examples of function g are: (i) the sum z+ y, with 


(K+Y) =f [e+ wtxylau)dody 


= [eter + f ufrw)ay = EX + EY 


(cf. equation (2.2.8)), and (ii) the product g(x,y) = xy, where 


E(XY ) = ff eutxy(eu)dedy, (2.2.14) 


Note that the Cauchy—Schwarz (CS) inequality (1.4.18) holds: 


B(XY)| < (EX?EY?)", 


as its proof in Section 1.4 uses only linearity of the expectation. 
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The covariance Cov(X,Y) of RVs X and Y is defined by 


Cov(X,Y) a — EX)(y — EY ) fx y(a, y)dady 
= EX)(Y —EY) (22:15) 


and coincides with 


| [ tv@nley — EXEY )dydz = E(XY ) — EXEY. (2.2.16) 


By the CS inequality, | Cov(X,Y)| < (VarX) es (VarY) ae with equality if 
and only if X and Y are proportional, i.e. X = cY where c is a (real) scalar. 
As in the discrete case, 


Var(X + Y) = VarX + VarY + 2 Cov(X,Y) (2.2.17) 


and 
Var(cX) = c?VarX. (2.2.18) 


If RVs X and Y are independent and with finite mean values, then 


E(XY) = EXEY. (2.2.19) 


The proof here resembles that given for RVs with discrete values: by for- 
mula (2.2.12), 


(KY) = | [ yets(o)fr(yiaray 
= f efx(o)de f yfy(y)dy = EXEY. 


An immediate consequence is that for independent RVs 


Cov(X,Y) =0, (2.2.20) 
and hence 
Var(X + Y) = VarX + VarY. (2.2.21) 


However, as before, neither equation (2.2.19) nor equation (2.2.20) implies 
independence. An instructive example is as follows. 


Example 2.2.7 Consider a random point inside a unit circle: let X and 
Y be its co-ordinates. Assume that the point is distributed so that the 
probability that it falls within a subset A is proportional to the area of A. 
In the polar representation: 


X = RcosO,Y = RsinO, 
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where R € (0,1) is the distance between the point and the centre of the 
circle and O € [0, 27) is the angle formed by the radius through the random 
point and the horizontal line. The area in the polar co-ordinates p, @ is 


~pi(0 <p <1)1(0 < 0 < 2r)dpdé = fr(p)dpfo(4)dé, 


whence R and © are independent, and 


fr(p) =2p0< p<), fo(8) = 5-10 <0 < 2m), 
Then 

EXY = E(R? cos @sin @) = E(R?)E (5sintze) =0, 
as 


E sin(20) = = | sin(20)d0 = 0. 
0 


Similarly, EX = ERE cosO = 0, and EY = ERE sinO = 0. So, Cov(X,Y) = 
0. But X and Y are not independent: 


p(x>Br> 2) =o but p(x> 2) =r (v>¥) > 0. 


2 2 


Generalising formula (2.2.21), for any finite or countable collection of in- 
dependent RVs and real numbers 


Var (= “x =) > eVarX,, (2.2.22) 


provided that each variance Var X; exists and )>, c? Var X; < oo. In partic- 
ular, for ID RVs Xj, Xo, ... with Var X; = 07, Var (S07, X;) = no?. 
The correlation coefficient of two RVs X, Y is defined by 
Cov(X,Y) 
VVarX V/VarY 


As |Cov(X,Y)| < (Varx)/? (Vary), —-1 < Corr(X,Y) < 1. Further- 
more, Corr(X,Y) = 0 if X and Y are independent (but not only if), 
Corr(X,Y) = 1 if and only if X = cY with c > 0, and —1 if and only 
if X =cY withc <0. 


Corr(X,Y) = (2.2.23) 


Example 2.2.8 For a bivariate normal pair X, Y, with the joint PDF 
fx,y of the form (2.1.35), parameter r € [—1,1] can be identified with the 
correlation coefficient Corr(X, Y ). More precisely, 


VarX = 0%, VarY = 03, and Cov(X,Y) =rojoo. (2.2.24) 
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In fact, the claim about the variances is straightforward as X ~ N(11,07) 
and Y ~ N(2, 03). Next, the covariance is 


Cov(X,Y) = ee | fe b)(y — 2) 


<exp f si5 [ea sar op = HY = Ha), (= pa)" |} ayaa 


o? 0102 ee 


i. 1 re LH + Gi |} 
= LY CX 2r + dydx 
QrojooV 1 — r2 ia @ Pf 2(1 — r?) a i a2 y 
= |] LY CO eXD dydx 
QrojoeV1 — r2 2(1 — r?) +3 


where now 
02 
y= y—rs—. 
O71 
Therefore, 
Cov(X,Y) = oss [we / (u + reZ2) 
: QrojoeV1 — r2 O71 
2 
yy 
x —_— + — | dyd 
9 | 2o3(1 — = oe 


TO2 _ T0102 ae 
= 5 ja e © i dy = re? fate © /2ae = roo. 
V 210? V2T 


In other words, the off-diagonal entries of the matrices © and U~! have been 
identified with the covariance. Hence, 


Corr(X,Y)=r 


We see that if X and Y are independent, then r = 0 and both © and 
=! are diagonal matrices. The joint PDF becomes the product of the 
marginal PDFs: 


(=i) @= tw)? 
exp a 5 
270109 207 205 


fxy (2, y) = 


e7 (@—H1)?/207 g—(y—H2)? /203 
V 2101 V 2102 


An important observation is that the inverse is also true: if r = 0, then 


© and U7! are diagonal and fx,y(zx,y) factorises into a product. Hence, for 
jointly normal RVs, Cov(X,Y ) = 0 if and only if X, Y are independent. 
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The notion of the PGF (and MGF) also emerges in the continuous case: 


éx(s)=Es* = i sfx(a)de, 9 >0, 


Mx (0) = Ee’* = [et tx(a)ae = ox(e’), OER. (2.2.25) 


The PGFs and MGFs are used for non-negative RVs. But even then ¢x(s) 
and Mx(0@) may not exist for some positive s or 6. For instance, for X ~ 
Exp(A): 


(oe) 
=e Ce dg = M. 
ox(s) [ ee a =e 
Here, x exists only when In s < 4, i.e. s < e*, and Mx(0) when @ < X. For 
X ~ Ca(a,T), the MGF does not exist: Ee?* = 00, ER. 
In some applications (especially when the RV takes non-negative values), 


one uses the argument e~® instead of e®; the function 


Lx (0) = Ee°* = Mx(-6) = [et tx(aac, 6ER, (2.2.26) 


which as in the discrete case, is called the Laplace transform (LTF) of PDF 
fx. See equation (1.5.19). 
On the other hand, the characteristic function (CHF) wx(t) defined by 


x(t) = Ee’* = [et txrn, teR, (2.2.27) 


exists for allt € Rand PDF fx. See (1.5.20). Moreover: wx (0) = El = 1 and 


jwxtOl s fet*| fx(de =f fx(wjae = 1. 


Furthermore, as a function of t, wx is (uniformly) continuous on the whole 
line. In fact: 


x(t + 6) — ox(4)| < | cllt+8)e _ oite| Fy (a)da 


_ i elt” 


= of <4] f(n)de, 


ge 1 fx(a)dax 


The RHS does not depend on the choice of t € R (uniformity) and goes to 
Oasd— 0. 
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The last fact holds because ei” _ 1| — 0, and the whole integrand 
|e?” — 1] fx(x) is < 2fx(x), an integrable function. A formal argument 
is as follows: given € > 0, take A > 0 so large that 


—A oe) 
2 f fx(ojan +2 f Plavdine 7. 


Next, take 6 so small that 


oie 1 <e¢/4A, forall-A<a2<A; 


we can do this because e®* —> 1 as 6 —> 0 uniformly on (— A, A). Then split 
the entire integral and estimate each summand separately: 


/ 


olde _ 1 fx(a)de 
Rae Ne 
<2(f +f) sstorees [ 


We want to emphasise a few facts that follow from the definition: 


ete 1 fx(a)dax 


ior € 2Ae 
de, eS e ee, 
e 1) Fx(e)de S5 + iA € 


(1) If Y =cX +8, then py(t) = eax (ct). 
(2) As in the discrete case, the mean EX, the variance Var X and higher 
moments EX* are expressed in terms of derivatives of wx at t = 0: 


1d 
Lx = Tape 


(2.2.28) 
t=0 


(2.2.29) 


2 2 
vinx = | Sixt | [ext Vo 


(3) CHF x(t) uniquely defines PDF fx (x): if x(t) = vy (t), then fx(x) = 
fy(x). (Formally, the last equality should be understood to mean that 
fx and fy can differ on a set of measure 0.) 

(4) For independent RVs X and Y: wx+y(t) = wx(t)yy(t). The proof is 
similar to that in the discrete case. 


Example 2.2.9 In this example we calculate the CHF for: (i) uniform, (ii) 
normal, (iii) exponential, (iv) Gamma, and (v) Cauchy. (i) If X ~ U(a,b), 
then 


1 De itb _ ita 
vx (t) = / ett gy = SF (2.2.30) 
a 
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(ii) If X ~ N(, 07), then 


wx(t) = exp (itu - ;t0?) (2.2.31) 


To prove this fact, set 4 = 0, 0? = 1 and take the derivative in t and 


integrate by parts: 


d = da 1 ita .—a? /2 = 1 : ite .—a?/2 
a exh) = ral =fe e ar) = ie (ix)e”*e dx 


1 2 . [o-e) 1 i 2 
_ _3,—-0@7/2 ite 7 ita —a*/2 =e 
Vom ( le e ) ie = tee dx twx(t). 
That is 
#2 
(In Hx(t))’ = —t, whence In ~x(t) = —~ +c, ie. Yx(t) = fen /2. 


2 
As ~x(0) = 1, c= 0. The case of general ps and a? is recovered from the 
above property (1) as (X — w)/o ~ N(0, 1). 

The MGF can also be calculated in the same fashion: 


1 
Ee’X — exp (4 + 50°) ; 


Now, by the uniqueness of the RV with a given CHF or MGF, we can 
confirm the previously established fact that if X ~ N(11,07) and Y ~ 
N(2, 03), independently, then X + Y ~ N(p1 + pe, 07 + 03). 

(iii) If X ~ Exp(A), then 


wx(t) =r . ete da = 5 e 7 (2.2.32) 
(iv) If X ~ Gam(a, A), then 
t =m 
vx(t) = (1 = x] (2.2.33) 


To prove this fact, we again differentiate with respect to t and integrate 
by parts: 


d ef t 167A Xr“ [ (it—d) 
Se itz ,.a— x a (i rq 
Sxlt Serwe ee x dx Tia) ix*e x 


194 Continuous outcomes 
That is 
(In 7x (t))’ = (—aIn (A — it))’, whence wx(t) = e(A — it)~*. 


As wx(0) = 1, c= A. This yields the result. 

Again by using the uniqueness of the RV with a given CHF, we can see 
that if X ~ Gam(a, A) and Y ~ Gam(a’, A), independently, then X + Y ~ 
Gam(a +a’, A). Also, if X1,..., Xn are IID RVs, with X; ~ Exp(A), then 
Xi +-+--+X, ~ Gam(n, d). 

(v) To find the CHF of a Cauchy distribution, we make a digression. In 
analysis, the function 


teRo | cl 7(x)dz, teER, 


is called the Fourier transform of the function f and denoted by 7 The 
inverse Fourier transform is the function 


1 owes 
rERv se fot Float, xceER. 
7 


If fx is a PDF, with fx >0 and f fx(x)dx = 1, and its Fourier transform 
f = ox has f[ iO) dt < oo, then the inverse Fourier transform of wx 


coincides with fx: 
1 


=> / e aby (t)dt. 


Furthermore, if we know that a PDF fx is the inverse Fourier transform of 


fx(z) 


some function y(t): 


ae! 
On 


ee / ete (eda, 


then ~¢(t) = ~x(t). (The last fact is merely a rephrasing of the uniqueness 
of the PDF with a given CHF.) 

The Fourier transform is named after J.B.J. Fourier (1768-1830), a prolific 
scientist who also played an active and prominent role in French political 
life and administration. He was the Prefect of several French Departements, 
notably Grenoble where he was extremely popular and is remembered to 
the present day. In 1798, Fourier went with Napoleon’s expedition to Egypt, 
where he took an active part in the scientific identification of numerous 
ancient treasures. He was unfortunate enough to be captured by the British 
forces and spent some time as a prisoner of war. Until the end of his life he 
was a staunch Bonapartist. However, when Napoleon crossed the Grenoble 
area on his way to Paris from the Isle of Elba in 1814, Fourier launched 
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an active campaign against the Emperor as he was convinced that France 
needed peace, not another military adventure. 

In mathematics, Fourier created what is now called the Fourier analysis, by 
studying problems of heat transfer. Fourier analysis is extremely important 
and is used in literally every theoretical and applied domain. Fourier is also 
recognised as the scientific creator of modern social statistics. 

Now consider the PDF 


1 
(a= A £eER. 


Its CHF equals the half-sum of the CHFs of the ‘positive’ and ‘negative’ 
exponential RVs with parameter \ = 1: 


ee if 4 1 1 
_— d — = 5 
se Pe = Ea ea) tee 


which is, up to a scalar factor, the PDF of the Cauchy distribution Ca(0, 1). 
Hence, the inverse Fourier transform of the function e~!*! is 


1 “Hie ie a 
= es 
Qqr [° : mt1+2? 


Then, by the above observation, for X ~ Ca(1,0), Wx(t) =e". For X ~ 
Ca(a,T): 


ox(t) =e 74, (2.2.34) 


as (X — a)/T ~ Ca(0, 1). 

Note that wx(t) for X ~ Ca(a,rT) has no derivative at t = 0. This 
reflects the fact that X has no finite expectation. On the other hand, 
if X ~ Ca(ay,71) and Y ~ Ca(ag,72), independently, then X + Y ~ 
Ca(a 1 + Q2, 7 +72). 


In fact, the Cauchy distribution emerges as the ratio of jointly normal 
RVs. Namely, let the joint PDF fx y be of the form (2.1.35). Then, by 
equation (2.1.39), the PDF fx y(u) equals 


osu — 2rayoqu + a?)| dy 


1 -y? 
== ex 
2roj02V1—r? [iw 4 carat —r?) ( 


-y?  (o}u? — 2rojoqu + 07) 


1 [oe 
———— ex 
tTojooV1 — r? i: er a =?) o105 
o10oV1—r2 me -y1 
0 


m(o3u2 — 2rojoqu + oF 


2 
dy 
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where 
- y? ou? — 2rucjo2 + 0? 
a 2(1 — r?) OnE! 
We see that 
fxyy(u) Le rei) (2.2.35) 
u) = ; 2. 
ADS (w= rai /02)2 + A — 1?)03/03) 
ie. X/Y ~Ca (roi /o2, v1l— a1 /02). For independent RVs, 
Migs (2.2.36) 


m(u? + of /03)’ 
ie. X/Y ~ Ca(0,01/02). See Worked Example 2.1.20. 


Worked Example 2.2.10 Suppose that n items are being tested simul- 
taneously and that the items have independent lifetimes, each exponentially 
distributed with parameter A. Determine the mean and variance of the length 
of time until r items have failed. 


Solution The time until r items have failed 
Tea Yet 


where Y; ~ Exp((n — 7+ 1)A) is the time between the (i — 1)th and the ith 
failures. In fact, Y; is distributed as a minimum of (n — i+ 1) independent 
Exp(A) RVs. This implies 


a 


" 1 1 
, r= a Te == SS ee 
u Gap) vert) dG De 


i=1 


Similar problem appeared in physics in the study of photon emission in 
scintillation counters. Though the total number of particles n is random 
(say, Exp(4)) this approach works well, see [79], [129]. 


Worked Example 2.2.11 Let U,,U2,...,Un be IID random variables. 
The ordered statistic is an arrangement of their values in the increasing 
order: Uj) <--- < Ui). 
(i) Let 2, Z,..., Zp, be IID exponential RVs with PDF f(x) =e”, x > 0. 
Show that the distribution of Z(1) is exponential and identify its mean. 
(ii) Let X1,...,X» be IID RVs uniformly distributed on the interval [0, 1], 
with PDF f(z) = 1,0 < x < 1. Check that the joint PDF of 
X (1); X(2), sad »X(n) has the form 


Tx Xes ck eD nea A) = niI(0 <4 Se SX < fs 
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(iii) For random variables 21, Z2,..., Zn and X1, X2,..., Xn as above, prove 
that the joint distribution of X(1), X(2),---,;X(n) is the same as that of 
Sy en 
Sati” Saga 


Here, for 1 <i <n, S; is the sum paar 24 O00 Saft = Sab Zais 
where Z,41 ~ Exp(1), independently of Z1,..., Zn. 

(iv) Prove that the joint distribution of the above RVs X(1), X(a),---;X(n) 
is the same as the joint conditional distribution of 5 ,S2,...,Sp given 
that Sn4i =I, 


Solution (i) P(Zq) > 2) =P(4 > a,...,Z, > x) =e". Hence, Z(1) is 
exponential, with mean 1/n. 
(ii) By the definition, we need to take the values of X1,..., X» and order 


them. A fixed collection of non-decreasing values 71,...,2%n can be produced 
from n! unordered samples. Hence, the joint PDF TX ihe Res (Dip nant) 
equals 


alge) 2 fle HOS ay Sa <1) SO ae es ae, 1) 


in the case of a uniform distribution. 
(iii) The joint PDF TS Sate eo Seas of RVs 1) Saeed sey Syren 
and S,41 is calculated as follows: 


SSS Searle ag (x1, v2, et »Un, t) 
oO” Si Sn 
Ppa (2) Ox 28 OLn (4 <1, ; Sn4t ae be ) 
o” 
= t) ____P(S fois, Sy < Ext| Sia =t 
FSnsi 5 oa- ( 1< 70, ; <2 | +1 ) 


SO fe erie Gityesns Wat Oya a) 

SS PPG ecb So tibet nyt) 

= {%e-@1e—t(w2-21) ., -e t1-2n) 7 (0 <a <+++< an <1) 
=f"e I(0 <a << ay <1)! 


Hence, 


oo 
FSIS 4, cigs (Liga, ae 5 i,,) = / te tdt I(0 <2 Se Sy < 1). 
0 


This equals n!I(0 < a1 < +++ < a < 1), which is the joint PDF of X(q), 
XQ); re X(n); by (ii). 
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(iv) The joint PDF fs, ..Sn41(€1,---,2n41) equals 


Meee ae sale 2 — U1,-..-,Ln41 — Hn) I(O <a <-++ <S ay < ee 
=e ®1e—(€2-21) ., -e7 (@nt1—-#n) T(0 <0 SS ty XS Saad) 


The sum S,41 has PDF fg,,,(a@) = (1/n!)a"e-* I(x > 0). Hence, the condi- 
tional PDF could be expressed as the ratio 


Tie iG ois s+) In; 1) 


fe (1) 


=nll(0 <a <<a, <1), 


Worked Example 2.2.12 Let Xj, X2,... be independent RVs each of 
which is uniformly distributed on [0, 1]. Let 


Un = X5,Vn = min Xj. 
<n 


\<js l<jgn 


By considering 
P(u < Xi <vu+du,u < Xo, X3,...,Xn-1 < usu < Xn < ut ou) 

with 0 < v < u < 1, or otherwise, show that (U,,V,) has a joint PDF 
given by 

n(n—1)\(u-v)""?, if0<v<uK<1, 

f(u,v) = ; 

0, otherwise. 
Find the PDFs fy,, and fy,,. Show that P(1—1/n < U;, < 1) tends to a non- 
zero limit as n — oo and find it. What can you say about P(0 < V, < 1/n)? 


Find the correlation coefficient 7, of U, and V,. Show that 7, — 0 as 
n — oo. Why should you expect this? 


Solution The joint CDF of U, and V,, is given by 


P(Un <u, Vn <u) =P(Un <u) —P(Un <u, Vn, > v) 
=P, <a) Pes Xt Si (0S Ky SG) 
= (F(u))” — (F(u) — F(v))”. 


Hence, the joint PDF 


2 
ferns ts) = soa [UP(W))" — (F(u) — FO)" 


a. 1(u—v)"-?, ifO<u<u<1 


0, otherwise. 
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The marginal PDFs are 
fu day = i n(n —1)(u— v)"-2du = nu™1 
0 
and 
1 
Sil) = / n(n —1)(u—v)"*du = n(1 — v)"1. 


Then the probability P(1 — 1/n < U, < 1) equals 


1 1 1 n 
/ fu,, (u)du = nf udu =1 (1 ) >1—e lasn—- oo. 
1-1/n 1-1/n n 


Similarly, P(O < V, <1/n) 3 1-e71. 
Next, 


EUn 


and 
n 


VarU, Var nF 1n+2) 


Furthermore, the covariance Cov(Un, Vn) is calculated as the integral 


[ [9009 (1s) (0) ae 
: / J noes (uP) (vy) 


aoe e) (« ) Fran cerreren ke 


Cov(Un, Vn) 1 


tat COU Vi) = rrr vaRvn 


+0 


as n> &. 
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Worked Example 2.2.13 Two real-valued RVs, X and Y, have joint PDF 
4 1 

p(x1, £2) - ay le 2(1 — 2 

where —1 <r <1. Prove that each of X and Y is normally distributed with 


mean 0 and variance 1. 
Prove that the number r is the correlation coefficient of X and Y. 


exp ax? — Irxyr2 + x3), 
y/ 1 2 


Solution The CDF of X is given by 


t 
P(X <t) =| [ vlesza)daader, 


The internal integral [ p(x1,72)dx is equal to 

1 / { 1 
—_— | ex 
QnvV1 — r2 P 2(1—r 


e 1/2 y? L2—-7TLy 
= [exw (-) dy, (wien y= a) 
27 2 J/1 — r2 


which specifies the distribution of X: 
1 


t 
P(X <t) =). eda, 


oo V2T 


ie. X ~ N(0,1). Similarly, Y ~ N (0,1). 
Hence, 


Corr(X, Y ) = Cov(X,Y) =E(XY) = | [ e2evler.z2)deader, 
Now 


[ vep(e1.2)aer 
1 i: { 1 
= _ _ v2 CX 
Wiis ae) 
rxyje 1/2 


Vin 


(ws ray)” +(1- 7?) a3] } dre 


Then 


; r ee xy 
E( XY) = Jan #1 exp | > ar 75 
—oo 


Hence, Corr(X, Y ) = r, as required. 
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Worked Example 2.2.14 Let X,Y be independent RVs with values in 
(0,00) and the same PDF 2e~*” /,/z. Let U = X2+Y?, V = Y/X. Compute 
the joint PDF fy,v and prove that U,V are independent. 


Solution The joint PDF of X and Y is 
4 

fry (ay) = =e OW), 

T 


The change of variables u = x2? + y”, v = y/a produces the Jacobian 


O(u, v) Qu 2y y? 2 
= =2+4+2— =2(1 
ata) at ( pjat ape ) 2+ BER =PO HO?) 


with the inverse Jacobian 


Hence, the joint PDF 
A(x, y) 2 


fu,v (u,v) = fxy(a(u, v), y(u, Nau 7 = oF sige? u>0,v>0. 


It means that U and V are independent, U ~ Exp(1) and V ~ Ca(0,1) 
restricted to (0,00). 


Worked Example 2.2.15 (i) Continuous RVs X and Y have a joint PDF 


(m+n-+ 2)! 
min! 


f(z,y) = awe 


for 0 < y < x < 1, where m,n are given positive integers. Check that f is 
a proper PDF (i.e. its integral equals 1). Find the marginal distributions of 
X and Y. Hence calculate 


r(vsjr-3) 


(ii) Let X and Y be random variables. Check that 


Cov(X,Y) = E[l — X] x EY —E[(1 — X)Y]. 


(iii) Let X,Y be as in (i). Use the form of f(x,y) to express the expec- 
tations E(1 — X), EY and E|(1 — X)Y] in terms of factorials. Using (ii), or 
otherwise, show that the covariance Cov(X,Y ) equals 


(m+1)(n+1) 
(m+n+3)2(m+n+4) 
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Solution (i) Using the notation B(m,n) =T (m)P (n) /T (m+n), 


1 x 1 
2)! 9)! n+1 
Canine Jo rey yidy — tnt fe x)” de 
min! min! n+l 
0 
(m+n -+ 2)! 


bd 
2 mln! aria cnet) 


(m+n+2)! 1 mi(n+1)! 
min! nt+1(m+n+2)! © 


Hence, 


r(reqhe=$)=()" frm/ fra)”. 


(ii) Straightforward rearrangement shows that 


E(1 — X)EY — E[(1 — X)Y] = E(XY) — EXEY. 


(iii) By the definition of the Beta function (see Example 3.1.5), 


2)! 1 1 
ia ee) RL 
mn! n+1 m+n+3 
2)! 1 1 
ay = (mnt?) Bot 7S 
min! n+2 m+n+3 
and 
(m+n+2)! 1 
E\(1—-xX)Y) = B + 2,n-4 
[( )Y] mind pgm t 2 + 3) 
a (m+ 1)(n +1) 
(m+n+3)(m+n+4) 
Hence, 
1 1 1 1 
Cou(xy) = Mt Din+1) (m+n) 


(m+n+3)?  (m+n+3)(m+n+4) 
(m+ 1)(n +1) 
(m+n+3)2(m+n+ 4)’ 
In Worked Example 2.2.16 a mode (of a PDF f) is the point of maximum; 
correspondingly, one speaks of unimodal, bimodal or multimodal PDFs. 


Worked Example 2.2.16 Assume X 1, X2,... form a sequence of inde- 
pendent RVs, each uniformly distributed on (0, 1). Let 


N=min{n: X,+ Xg+---+Xp, > 1}. 
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Show that EN =e. 


Solution Clearly, N takes the values 2,3,.... Then 


EN = IP(N =D) =1+ 5 PIN =D. 


I>2 I>2 


Now P(N > 1) = P(X1 + Xo +---+ Xj-1 < 1). Finally, setting 
P(X, + Xgt+--°+ Xi <y) = ay), 0<y <1, 


write qi(y) = y, and 


aly) = A px,(u)q—1(y — u)du = [ a—1(y — u)du, 


yielding go(y) = y?/2!, q3(y) = y?/3!,.... The induction hypothesis 
a_1(y) = y'“!/(L— 1)! now gives 


and we get that 


1 
ENS) CS apt Se 
= 1 3! 


Alternatively, let N(x) = min{n : X1 + Xo +---+Xy > x} and m(z) = 
EN (x). Then 


m(z) =1+ J mpau, whence m’(x) = m(z). 
0 


Integrating this ordinary differential equation with initial condition m(0) = 


1 one gets m(1) =e. 


Worked Example 2.2.17 What does it mean to say that the real-valued 
RVs X and Y are independent, and how, in terms of the joint PDF fx y 
could you recognise whether X and Y are independent? The non-negative 
RVs X and Y have the joint PDF 


1 
fxy(#,y) = 5 (2 a ye PMI (a,y > 0), LZ,YE R. 
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Find the PDF fx and hence deduce that X and Y are not independent. 
Find the joint PDF of (tX,tY ), where t is a positive real number. Suppose 
now that T is an RV independent of X and Y with PDF 


2t, if0<t<1 
p(t) = 
0, otherwise. 


Prove that the RVs TX and TY are independent. 


Solution Clearly, 
a 1 
fx(x) = | Ixy (a, y)dy = x(t +ax)e", «>0. 
0 


Similarly, fy(y) = (1+ y)e"¥/2, y > 0. Hence, fxy # fx fy, and X and Y 
are dependent. 
Next, for t,2,y € R, 


frxy(t,x,y) =t(at+ ye M10 <t < II (a,y > 0). 


To find the joint PDF of the RVs TX and TY we must pass to the variables 
t, u and v, where u = txz,v = ty. The Jacobian 


O(t, u, Vv) 


1 avy 
— 7 _det{| 0 ¢ O | =#?. 


Hence, 


1 
frx,ry (u,v) = [eirxyttult u/t)dt 


0 
1 

= foe exp ( att) dt. 
0 


Changing the variable t > + = t~! yields 


frx ry (u,v) = | dr(u+v)exp (—rT(u+ v)) = e (Ute) ay, v>0. 
1 


Hence, TX and TY are independent (and exponentially distributed with 


mean 1). 


Worked Example 2.2.18 <A pharmaceutical company produces drugs 
based on a chemical Amethanol. The strength of a unit of a drug is taken 
to be —In(1 — x), where 0 < x < 1 is the portion of Amethanol in the unit 
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and 1 — x that of an added placebo substance. You test a sample of three 
units taken from three large containers filled with the Amethanol powder 
and the added substance in unknown proportions. The containers have been 
thoroughly shaken up before sampling. 

Find the joint CDF of the strengths of the units and the CDF of the 
minimal strength. 


Solution It is convenient to set w = (x,y,z), where 0 < x,y,z < 1 represent 
the portions of Amethanol in the units. Then 2 is the unit cube {w = 
(x,y,z): 0<a2,y,2 <1}. If X1, Xo, X3 are the strengths of the units, then 


Xi(w) = —In(1—- 2), Xo(w) = —In(1 — y), X3(w) = —In(1 — 2). 


We assume that the probability mass is spread on Q uniformly, i.e. the 
portions x, y, z are uniformly distributed on (0,1). Then for all 7 = 1, 2,3, 
the CDF Fx, (x) = P(X; < y) is calculated as 


1 J—e7¥ 
| aati Sean i, AiG SWS ets), 
0 0 


ie. X; ~ Exp(1). Further, P(min; X; < y) = 1—P(min; X; > y) and 
j Pp ’ Garg y sae | y 


P(min X; > y) = P(X, > y, Xo > y, X3 > y) = | | P(X; >y)= eat 
j 
j 


That is min; X; ~ Exp(3). 
In this problem, the joint CDF F'x, x,,x3(y1, y2, y3) of the three units of 
drug equals the volume of the set 


(@.ase) 20 Sage el 
—In(1—2) <y, -In(1—y) < yo, —In(1— z) < ys} 


and coincides with the product 
(1—e%) (1—e#) (1-e"¥), 


ie. with Fy, (y1) Fx, (y2)Fx3(y3). The joint PDF is also the product: 


—£1 ,—©2,—%3 


e e 


fx1,X2,X3(%1, v2, x3) =e 


Worked Example 2.2.19 Let X,, X2,... be asequence of IID RVs having 
common MGF M(@), and let N be an RV taking non-negative integer values 
with PGF ¢(s); assume that N is independent of the sequence (X;). Show 
that Z = X, + Xp+---+ Xp, has MGF $(M(8)). 
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The sizes of claims made against an insurance company form an IID se- 
quence having common PDF f(z) = e* 
during a given year had the Poisson distribution with parameter A. Show 
that the MGF of the total amount T' of claims during the year is 


, « > 0. The number of claims 


W(0) = exp {A8/(1—4)} ford <1. 


Deduce that T has mean A and variance 2A. 


Solution The MGF 


Mz(0) = ReIZ ——E- [E (rey 


as required. 
Similarly, for T: 


T =X, +---+Xw, 


where Xj, ..., X,, represent the sizes of the claims, with the PDF fx,(x) = 
e “I(x > 0), independently, and N stands for the number of claims, with 
P(N =n) = "eA /nl. 

Now, the PGF of N is ¢(s) = e@—), and the MGF of X; is Mx,(@) = 
1/(1 — 6). Then the MGF of T is 


Hey ex L (S = i)] — Fe 


as required. Finally, 


ap'(0) = ET =, 0" (0) =ET? = 24 >, 


and Var T = 22. 


Worked Example 2.2.20 Let X be an exponentially distributed RV, 
with PDF 


1 
feQ= Seo x>O0, 


m 
where js > 0. Show that EX = p and Var X = p?. 
In an experiment, n independent observations X 1,...,X, are gener- 


ated from an exponential distribution with expectation 4, and an estimate 
X = yo, X;/n of ps is obtained. A second independent experiment yields 
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m independent observations Yi, ..., Ym from the same exponential distri- 
bution as in the first experiment, and the second estimate Y = paHe, Y;/m 
of p is obtained. The two estimates are combined into 


Tp = pX + (1—p)Y, 


where 0 <p<l. 
Find ET, and Var T;, and show that, for all € > 0, 


P(|Zp— pl >€) 30 as mn oo. 


Find the value p of p that minimises Var 7, and interpret 75. Show that 
the ratio of the inverses of the variances of X and Y is p/(1—?). 


Solution Integrating by parts: 


= : 3 ze */H da = (-2e*/") , + [ e*/Hdax = 


and 


(oe) 1 love) 
EX? = if ae */Hda = (—2?e-*/") , + 2uEX = 2p, 
0 


which yields 


Now, 
Ae jel 
EX =—) EX;=p, EY =—) EY; =p, 
n- m 4 
i=1 j=1 
and 
ET, = pEX + (1—p)EY =p 
Similarly, 
1 fee ye 
VarX = = SS" varX; — , VarY = 72 S_ varY; = m7” 
4=1 j=l 
and 
aes —_ 2 1—>9)2 
VarTp = p*VarX + (1 —p)?VarY = y? Eb +! 7 | 
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Chebyshev’s inequality P(|Z| >) < (1/e?)EZ? gives 


Diss 1 
x (Tp — p)? = aval, 


_w z _f a 


P(|Tp — wl > €) 


IA 
| 


>O as n,m— oo. 


To minimise, take 


d p_il-p ey es 
Vert, = 32 (2 ~-—?\ =0, with x= . 
Geo TAG =f) , with p oe 
As 
d? oo i ae 
— > VarT, =2u°|—-+—]>0, 
dp? ee n om 


p is the (global) minimum. Then 
nX +mY 
n+m 


P 
is the average of the total of n + m observations. Finally, 


D n 1/VarX — p?/m on 


1/VarY =p? /n om 


Worked Example 2.2.21 Let X1, X2,... be ID RVs with PDF 


where a > 0. 

There is an exceedance of u by {X;} at j if X; > u. Let L(w) = min {i > 
1: X; > u}, where u > 1, be the time of the first exceedance of u by 
{X;}. Find P(L(u) = k) for k = 1,2,... in terms of a and wu. Hence find 
the expected time EL(u) to the first exceedance. Show that EL(u) — co as 
Uu — 00. 

Find the limit as u > oo of the probability that there is an exceedance of 
u before time EL(u). 


Solution Observe that 


PUY Sa) = / on * "de = (-2°°)|~ = —. 


U a 
u U 
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Then 


i.e. L(u) ~ Geom(1 — 1/u%), and EL(u) = u%, which tends to co as u > oo. 
Now, with | - ] standing for the integer part: 


P(L(u) < EL(u)) = P(L(u) < [u*]) =1-P(L(w) > [u*]) 


k-1 
1 1 
aS OS (1-) ue 


k>[u]+1 
1 rye 1 
= a a _ = Qa 
u u 1— (1 -1/u*%) 
1 [u*] 
=1 1 = >I ae as U—> oo. 
U 


Worked Example 2.2.22 Let A and B be independent RVs each having 
the uniform distribution on (0, 1]. Let U = min{ A, B} and V = max{A, B}. 
Find the mean values of U and hence find the covariance of U and V. 
Solution The tail probability, for 0 <a <1: 

1- Fy(xz) =P(U > 2)=P(A>2, B> 2) = (1-2). 


Hence, still for 0 < x < 1: Fy(x) = 22 — 2”, and 


Therefore, 


fu(z) = Fi(2)\l(0<2<1)=(2-2r7)1(0<2 <1). 
i 


: 2_2 3 1 
Bu = f 2(2-2e)dz = (22 52°)] 3 


AsU+V=A+B and E(A+ B) =EA+EB=1/24+1/2 =1, 
EV =1-—1/3= 2/8. 


Next, as UV = AB, Cov(U,V) = E(\UV ) — EUEV which in turn equals 
2 2 1 


ae 
4(AB) — — = EAEB— ~ = oe 
en gy hae 0 ae ya 


Worked Example 2.2.23 The random variable X and Y each take values 
in {0,1}, and their joint distribution p(x, y) = P(X = 2,Y = y) is given by 


p(0,.0) =a,9(051) =6,p(1,0) =ep0 dl) =a 
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Find necessary and sufficient conditions for X and Y to be: 
(i) uncorrelated; and 

(ii) independent. 

Are the conditions established and (i) and (ii) equivalent? 


Solution (a) 


P(X =0) =a+b,P(Y =0) =a+ce,P(XX =1) =c+4,P(Y =1) =b+4; 


Cov(X, Y) =d—(c+d)(b+d). 


Thus X,Y uncorrelated if and only if d = (c+ d)(b+ d). This equality is 
clearly equivalent to ad = bc. 

(b) A,B independent events implies A,B are independent A, B are inde- 
pendent, and A, B are independent. Thus X,Y are independent if and only 
if P(X = 1,Y = 1) = P(X = 1)P(Y +1), ie. d = (c+ d)(b+ d). Thus, 
conditions in (a) and (b) are equivalent (in contrast with the case of general 
RVs). 


Worked Example 2.2.24 The random variable X has probability density 
function 


ex(l—a) if0<a2<1 
f(z) = 
0 otherwise. 


Determine c, and the mean and variance of X. 


Solution From the equality 


. i Py ce 
/ ca(1 — x)dax AS =)| F 


we find that c = 6. So, 


ax=6 | Pande 
i 12 2 
2 ‘3 3 
EX = i d =o == 
6 | x’(1—ax)dx 5 10 


and 
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Worked Example 2.2.25 The RVs X, and X92 are independent, and each 
has an exponential distribution with parameter ». Find the joint density 
function of 


Yj = X14+ Xo, Yo = X1/X2, 
and show that Y; and Yo are independent. What is the density of Yo? 


Solution Consider a map 
Bal 
T’: (v1, 2) — (yr, y2) where yr = 21 + ©2,42 = —— 
2 


and 21,9, Y1, y2 > 0. The inverse map T~! acts as follows: 


Y1y2 Y1 


Tl - ; — ‘ h — ; a 
(yt y2) (x1 x2) where £1 oe r2 


The Jacobian of this map is given by 


Y2 Y1 Y1Y2 . 
1 1 
J(y1, y2) = det as ae {1 +u2) 
1+ye2 (1+y2)? 
Y1y2 Y1 Y1 


(L+yo)3 (L+y2)3 (1+ ye)?’ 
Now we write the joint PDF 


fraying) = fx, (2, 4 ) wl . 


1+ y2’1+yo 1+ yo)? 
After substitution fx, x, (11,22) = de~**1 Ne~*”2, 1, eo > 0 we obtain 


AY1 


fyiyo (Yi, y2) = ye” V1, y2 => 0. 
2 


1 
(1+ y2 


Let us find the marginal distributions 


CO 
fyi(y) = | fyiya(yi, ya)dy2 = Wyre", yr > 0 
0 
and 
2 1 
fy2(y2) i Syy,¥2 (1; y2)dyn a tye? >0 


By the equality fy, yo (41, y2) = fv, (v1) fyo(y2) the RVs Y, and Y2 are inde- 
pendent, and fy,(y2) = atm? 


Worked Example 2.2.26 Let X1, X9,... be asequence of IID continuous 
RVs. Let the random variable N mark the point at which the sequence stops 
decreasing: that is, N > 2 is such that 


Ay eke 2 Se Re SAM 
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where, if there is no such finite value of N, we set N = co. Compute P(N = 
r), and show that P(N = oo) = 0. Determine EN. 


Solution Let RVs X,, Xo, ... are IID and continuous, 


P(X] > X2>--+- > Xn) 
P(X, > X2>-:+> Xn) =H, 
since all permutations are equally likely. 
But we have the equality of events: 


{N >n}={X1>X.>---> Xn}, 


implying that 


PS a Nee NI 7 > 2. 
Then 
1 1 
ye | |= 
2<n<co 2<n<oco (n 7 1)! nt 
Further, 
EN = >> P(N2>n) 
1<n<oo 
1 
=1+ ee (n = 1)! 
1 
= ven ese mee 


Worked Example 2.2.27 Let X and Y be independent non-negative 
RVs, with densities f and g respectively. Find the joint density of U = X 
and V = X + aY, where a is a positive constant. 

Let X and Y be independent and exponentially distributed RVs, each 
with density 


f(z) =de™**, ge > 0. 


1 
Find the density of X + ne Is it the same as the density of the random 
variable max(.X, Y)? 


Solution Set: U=X,V=X+aY, with 


1 
X =U, Y=3(V-U), 
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and the Jacobian 


This yields the joint PDF 


1 
hy,v (u,v) = fx(u)gy ((v — u)/a)1(v > u)- u>0,v>0. 
In the example, 
hu.v(u,v) = Ae - Ne OK. gy Sap SO, 


Then the PDF of the random variable V equals 
hy(v) = i Dee ee he) aa, 
0 
= | edu 
0 


= 9267 2A ee 
X 


0 


2 ,—2Au oe 1 —2)dv (Av 
= 2\“e oo = 2Xe (e — 1), v>0. 


Finally, 
P( max [X,Y] < v) =P(X < v)P(Y <v) =(1- ew Av)? 
Then the PDF of the RV max [X,Y] is 


ee [X,Y] (v) = 2(1 az eres 
= 2\e 2” (e” — 1); v>0. 


We see that the two definitions are identical. 


Problem 2.2.28 <A radioactive source emits particles in a random direc- 
tion (with all directions being equally likely). The source is held at a distance 
d from a photographic plate which is an infinite vertical plane. 


(i) Show that, given the particle hits the plate, the horizontal co-ordinate 
of the point of impact (with the point nearest the source as origin) has 
PDF d/n(d? + 27). 

(ii) Can you compute the mean of this distribution? 

Hint: (i) Use spherical co-ordinates r, w, 6. The particle hits the photo- 


graphic plate if the ray along the direction of emission crosses the half-sphere 
touching the plate; the probability of this equals 1/2. The conditional 
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Figure 2.20 


distribution of the direction of emission is uniform on this half-sphere. Conse- 
quently, the angle w such that x/d = tanw and x/r = sin is uniformly dis- 
tributed over (0,7) and @ uniformly distributed over (0, 27). See Figure 2.20. 


Problem 2.2.29 A shot is fired at a circular target. The vertical and 
horizontal co-ordinates of the point of impact (taking the centre of the target 
as origin) are independent RVs, each distributed normally N(0, 1). 


(i) Show that the distance of the point of impact from the centre has PDF 
re” /2 for r > 0. 

(ii) Show that the mean of this distance is \/7/2, the median is Vin 4, and 
the mode is 1. 


Hint: For part (i) see Problem 2.2.28. For part (ii) recall, for the median 7, 


[rear = [ re” /2dr, 
0 a 


Problem 2.2.30 The RV X has a log-normal distribution if Y = In X is 
normally distributed. If Y ~ N(y,07), calculate the mean and variance of 
X. The log-normal distribution is sometimes used to represent the size of 
small particles after a crushing process, or as a model for future commodity 
prices. When a particle splits, a daughter might be some proportion of the 
size of the parent particle; when a price moves, it may move by a percentage. 

Hint: E(X") = My(r) = exp(rp + r207/2). For r = 1,2 we immediately 
get EX and EX?. 


2.3 Normal distribution. Convergence of random variables 
and distributions. The Central Limit Theorem 


Probabilists do it. After all, it’s only normal. 
(From the series ‘How they do it’.) 


We have already learned a number of properties of a normal distribution. Its 
importance was realised at an early stage by, among others, Laplace, Poisson 
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and of course Gauss. However, progress in understanding the special nature 
of normal distributions was steady and required facts and methods from 
other fields of mathematics, including analysis and mathematical physics 
(notably, complex analysis and partial differential equations). Nowadays, the 
emphasis is on multivariate (multidimensional) normal distributions, which 
play a fundamental r6le wherever probabilistic concepts are in use. Despite 
an emerging variety of other exciting examples, they remain a firm basis from 
which further development takes off. In particular, normal distributions form 
the basis of statistics and financial mathematics. 

Recall the properties of Gaussian distributions which we have established 
so far: 


(i) The PDF of an N(y,07) RV X is 


1 
Jf 27 


with the mean and variance 


1 2 
: exp ( 552 (a — p) ) , «ER, (2.3.1) 


EX =p, VarX =o? (2.3.2) 


and the MGF and CHF 


19242 ae een Oe ee) 
Re tao et = tae OPER, (2.3.3) 


If X ~N(,07), then (X — p)/o ~ N(0,1) and for all b,c € R: cX +b ~ 
N(cu + b, e207). 

(ii) Two jointly normal RVs X and Y are independent if and only if 
Cov(X,Y) = Corr(X, Y) =0. 

(iii) The sum X + Y of two jointly normal RVs X ~ N(1,0?) and 
Y ~ N(u2,03) with Corr(X,Y) = r is normal, with mean py + po 
and variance 0? + 2raj02 + 02. See equation (2.1.41). In particular, if 
X,Y are independent, X+Y ~ N(1+2, 07+03). In general, for inde- 
pendent RVs X1, X2,..., where X; ~ N(;, 07), the linear combination 


wi ciXi~ N (D0, cami 0, fo?). 


Worked Example 2.3.1 Here we describe the famous algorithm for sim- 
ulation of Gaussian RVs introduced by George Box and Mervin Muller 
in 1958. Let ($,7) be uniformly distributed on [—1,1]? and define R = 
VS? + T?. Show that, conditionally on 


R<1, 
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the vector (5,7) is uniformly distributed on the unit disc. Let (, @) denote 
the point (.S, 7’) in polar co-ordinates and find its probability density function 
f (7,9) for r € [0,1], 6 € [0, 27). Deduce that R and © are independent. 


Introduce the new RVs 


X= ° —2log(R?), Y= pV —2log (F), 


noting that, under the above conditioning, (S,7) are uniformly distributed 
on the unit disc. The pair (X,Y) may be viewed as a (random) point in R? 
with polar co-ordinates (Q, WV). Express Q as a function of R and deduce 
its PDF. Find the joint PDF of (Q,W). Hence deduce that X and Y are 
independent normal RVs with zero mean and unit variance. 


Solution Let D denote the unit disk {(s,t) : s? +t? < 1}; it has area 
|D| = 7. Given a subset A C [0, 1]?, write 
P((S,T) € A()D) 

P((S,T) € D) 


= pf Met € A{ )D)dsde; 


P((S,T)€ A|S?+7? <1) = 


this implies the uniform distribution in D. 


The last expression, in polar co-ordinates, reads 


~ [10 <r<1)1(0 <6 < 2r)rdrdé. 
1 


This yields that the joint PDF fr.e(r,@) of the radius R and the argument 
6 equals 1(0 < r < 1)1(0 < 6 < 2z)r/z. By writing 
1 
Sretee) = 210 are 1) 5, 10 <0 <2n), 
T 


we see that fre(r,@) factorises as fr(r) fe(@). Hence, RVs R and © are 
independent, and 


far) =2r10 <<. 1),. fol?) = M0 Oe 27): 


The last equation means that 0 ~U(0, 27). 
Next, 


Q? = X?+Y? = -2log(R”), and ¥=69, 
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as (X,Y) is a positive multiple of (.S,7'). Thus, independence of Q and U 
is inherited from that of R and ©. The marginal PDF fg is obtained from 
flr): with r(q) =e? ?, 


r r 2 
fala) = falr(a)) |S] =10@ > 0/8 


Then the joint PDF fo,w reads 


= 1(q > 0)ge#/?. 


1 
fow(a) = 9e*P1(q > 0) x 5-10, < 2n). 
Passing back to the Cartesian co-ordinates, we get 


1 
fxy (a, y) = ee: 


Thus, X and Y are IID, with X ~ Y ~N(0,1). 


Remark 2.3.2 (Compare Worked Example 1.6.5.) Observe that knowing 
that the distribution is normal allows a much smaller sample size. 


As has been already said, the main fact justifying our interest in Gaussian 
distributions is that they appear in the celebrated Central Limit Theorem 
(CLT). The early version of this, the De Moivre-Laplace Theorem (DMLT), 
was established in Section 1.6. The statement of the DMLT can be extended 
to a general case of IID RVs. The following theorem was proved in 1900-1901 
by a Russian mathematician, A.M. Lyapunov (1857-1918). 

Suppose X1, X2,... are IID RVs, with finite mean EX; = a and variance 
Var Xj; = o?. If S, = X, +-+-+Xn, with ES, = na, Var S, = no?, then for 
all y € R: 


S, — ES), 
lim P ( ~—* < y| = 4(y). 2.3.4 
Jim ( aS, v) (y) (2.3.4) 


In fact, the convergence in equation (2.3.4) is uniform in y: 


Sp, — ES), 
li P < ® = 0. 
aoe ( Wad, v) w) 


Lyapunov and Markov were contemporaries and close friends. Lyapunov 
considered himself as Markov’s follower (although he was only a year 
younger). He made his name through Lyapunov functions, a concept that 
proved to be very useful in analysis of convergence to equilibrium in various 
random and deterministic systems. Lyapunov died tragically, committing 
suicide after the death of his beloved wife, amidst deprivation and terror 
during the civil war in Russia. 
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Mathematicians stand on each other’s shoulders. 
K.-F. Gauss (1826-1877), German mathematician and physicist. 


As in Section 1.6, the limiting relation (2.3.4) is commonly called the 
integral CLT and often written as (S,, —ES,,)//VarS;, ~ N(0,1). Here, the 
CLT was stated for IID RVs, but modern methods can extend it to a much 
wider situation and provide an accurate bound on the speed of convergence. 


The proof of the integral CLT for general IID RVs requires special tech- 
niques. A popular method is based on characteristic functions and uses the 
following result that we will give here without proof. 

Let Y,Y1,Yo,...be a sequence of RVs with distribution functions 
Fy, Fy,, Fy,,... and characteristic functions wy, y,,Wyz,.... Suppose that, 
as n — oo, CHF wy. (t) > vy (t) for all t € R. Then Fy. (y) > Fy(y) at 
every point y € R where CDF Fy is continuous. 

In our case, Y ~ N(0,1), with Fy = ®, which is continuous everywhere 
on R. Setting 


Sn — an 


Vis 
n Jno 9 


we will have to check that the CHF wy, (t) > et’ /2 for all t € R. This 
follows from a direct calculation which we perform below. Write 


oe ON ¥ Xj—a 
o o 
1<j<n 


and note that (X; — a)/o is IID. Then 


I n 
D(Sn—an)/( na) (t) = ix -aye Gal 


The random variable (X; — a)/o has mean 0 and variance 1. Hence its 
CHF admits the following Taylor expansion near 0: 


2 2 
(wu) = pO) + wh!(0) + 5-0"(0) + o(w) = 1 — 5 + o(w?). 


(Here and below we omit the subscript (X; — a)/o.) This yields 


o()-» See) 
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c (=)| - c s ! o(=)] er, (2.3.5) 


A short proof of equation (2.3.5) is to set 


But then 


Aaa -o(5). Bee Sie 


n 2n’ 
and observe that, clearly, B” > eo P/2, Next, 
At BS AA B) eA ASH BBs (AS 8)B, 
whence 
|A” — B"| < (max [1, A, B])" | |A— Bl, 
which goes to 0 as n > co. 


Worked Example 2.3.3 Every year, a major university assigns Class A 
to ~16 percent of its mathematics graduates, Class B and Class C each 
to ~ 34 percent and Class D or failure to the remaining 16 percent. The 
figures are repeated regardless of the variation in the actual performance in 
a given year. 

A graduating student tries to make sense of such a practice. She assumes 
that the individual candidates’ scores X),..., X, are independent RVs that 
differ only in mean values E.X;, so that ‘centred’ scores X; — EX; have 


the same distribution. Next, she considers the average sample total score 
distribution as approximately N(y,07). Her guess is that the above practice 
is related to a standard partition of students’ total score values into four 
categories. Class A is awarded when the score exceeds a certain limit, say 
a, Class B when it is between 6b and a, Class C when it is between c and b 
and Class D or failure when it is lower that c. Obviously, the thresholds c, b 
and a may depend on yu and o. 

After a while (and using tables), she convinces herself that this is indeed 
the case and manages to find simple formulas giving reasonable approxima- 
tions for a,b and c. Can you reproduce her answer? 


Solution Let X,; be the score of candidate j. Set 


a= Ss Xj; (the total score), 
1<j<n 


Sy, — ES, = S> (X; — EX;) (the ‘centred’ total score). 
l<j<n 
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Assume that X; — EX; are HD RVs (which, in particular, means that 
Var Xj = o? does not depend on j and Var S$, = no’). Then, owing to 
the CLT, for n large, 


S, — ES), 
Jno 


Thus the total average score S,,/n must obey, for large n: 
S 2 
n n 


where Y ~ N(0,1) and 


~ N(0, 1). 


i. 1 
= Ben = 3 EX ;. 
I<jsn 

We look for the value a such that P(Y > a) = 1—©(a) = 0.16, which gives 

a = 1. Clearly, P(Y > 1) = P(S,/n > o/V/n + 4). Similarly, the equation 

P(B<Y <1)= (1) — ®(6) ® 0.34 yields 6 = 0. A natural conjecture is 
thata=put+o,b=pandc=pu-<a. 

To give a justification for this guess, write X; ~ EX; + oY. This implies 

Sn 1 

n Jn 


(Xj — EX;) +p 


and 


1 2 ; 
P (05 —EX;)+p> i +1) = P(X; > EX;+0) =0.16. 


We do not know p or o? and use their estimates: 


tae Bi» ul aa 
pS mt PE (6-8) 


1<j<n 


(see Worked Example 1.4.8). Then the categories are defined in the following 
way: Class A is given when the candidate’s score exceeds ji + VG2, Class B 
when it is between fi and fi+ VG2, Class C when it is between fi — VG? and 
ji and Class D or failure when it is less than fi — VG. 


Worked Example 2.3.4 (Continuation of Worked Example 2.3.3.) Now 
suppose one wants to assess how accurate the approximation of the average 
expected score yu by S,,/n is. Assuming that o, the standard deviation of the 
individual score Xj, is < 10 mark units, how large should n be to guarantee 
that the probability of deviation of S,,/n from y of at most 5 marks does 
not exceed 0.01? 


2.3 Normal distribution 221 


Solution We want 


= 


Wy 


P ( Sn 
n 
Letting, as before, Y ~ N(0,1), the CLT yields that the last probability is 


<r (125%). 


"|> 5) < 0.01. 


n 


Thus, we want 


5vn > ® ~1(0.995), ie. n>O 
(ol 5 


“omy! 
with o? < 100. Here, and below, ®~! stands for the inverse of ®. As 
&~1(0.995) = 2.58, we have, in the worst case, that 


_ 100 


ag [® '(0.995)]° = 26.63 


will suffice. 


In the calculations below it will be convenient to use an alternative nota- 
tion for the scalar product: 


(x — pw) E(x — p) = (x- , UT p)) 


Worked Example 2.3.5 Suppose X1,...,X, are independent RVs, X; ~ 
N(j1;,07), with the same variance. Consider RVs Yj,..., Yn given by 


n 
Y; =) aiiXi, eS eee on 
i=1 


where A = (a;j) is an n x n real orthogonal matrix. Prove that Y1,...,Yn 


are independent and determine their distributions. 


Comment on the case where the variances o?,...,07 are different. 


(An n x n real matrix A is called orthogonal if A = (A7!)?.) 
Solution In vector notation, 
Yj, X 


Y= |: | =A'X, where X=| : 
¥, Xn 
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We have AT = A~!, (AT)-! = A. Moreover, the Jacobians of the mutually 
inverse linear maps, 


Ly Y1 
x= : Hy = Ax, y=|: | Rx=Ay, x,y €R”, 


In Yn 


are equal to +1 (and equal each other). In fact, 


ONGises sete) ©. pp, “OMissegen) > 
Otis. by) = |det A | Oi 4252 <n) ide 4 


and det AT =det A = +1. 
The PDF fy,,...v,(y) equals fx,.....x,(Ay), which in turn is equal to 


1 < 1 
(came) Te -aa | ome 


2 


l<j<n 1l<i<n 
Set 
G0 0 
MA : 0 oo? 0 
i= and o “I= ; , 
En 0 0 of 


where I is the n x n unit matrix. Writing (Ay), for the jth entry (ATy); = 
i<i<n aijyi of vector Ay, the above product of exponentials becomes 


1 2 
se ak ye (Ay), — 15] 
l<jxn 


1 a 
= exp |S (Ay — 1)" Day — 0 
= exp [—}(y —ATy)TAT(o 7D A(y — ATy)] - 
The triple matrix product AT(o~7I)A = o-?ATA =o~I. Hence, the last 


expression is equal to 


ae |-5 Guay 1g A™)| 


1 2 
= exp oR2 S- ly aa (AT), | 
l<jgn 


ate(| hy, 


1<j<n 
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Here, as before, (ATH), = Vi<i<n Uji stands for the jth entry of vector 
Aty. So, Yi,...,Yn are independent, and Y; ~ N((ATy);, 07). 


In the general case where variances ? are different, matrix o~7I must be 


replaced by the diagonal matrix 


a; 0 0 
yls : “a ke : (2.3.6) 
0 0 One 
Random variables Y;,...,Y, will be independent if and only if the matrix 
Aly TA 


is diagonal. For instance, if A commutes with D7, i.e. S7'A = AXW!, then 


ATy tA =ATAyY lay}, 


in which case Y; ~ N((Ay);, 05), with the same variance. 


Worked Example 2.3.5 leads us to the general statement that multivariate 
normal RVs X1,...,Xn are independent if and only if Cov(X;, X;) = 0 for 
alll <i<j <n. This is a part of properties (iv) as follows. So far we have 
established the fact that bivariate RVs X,Y are independent if and only if 
Cov(X, Y) = 0 (see equation (2.1.35)). 

Recall (see equation (2.1.9)) that a general multivariate normal vector 


Xj 


has the PDF fx(x) of the form 


1 
(/2n)"(det ¥) 


where © is an invertible positive-definite (and hence symmetric) n x n real 


f(x) = ew (-5G-mE- wy), B87 


matrix, and 


det ©-! = (det D)7*. 


For a multivariate normal vector X we will write X ~ N(w, X). 
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Following the properties (i)—(iii) at the beginning of the current section, 
the next properties of the Gaussian distribution we are going to establish 
are (iva) and (ivb): 


(iva) If X ~ N(w,%), with PDF fx as in formula (2.3.7), then each X; ~ 
N(wi, Sa), 7 = 1,...,, with mean p,; and variance 44, the diagonal 
element of matrix . 


(ivb) The off-diagonal element 4;; equals the covariance Cov(X;, X;), for 
alll < i<j <n. So, the matrices © and S~! are diagonal and 
therefore the PDF fx(x) decomposes into the product if and only if 
Cov(X;,X;) = 0, 1 <i < j <n. In other words, jointly normal RVs 
Xj ,...,Xpn are independent if and only if Cov(X;,X;) = 0,1 <i< 
qf OU: 


Naturally, the vector z is called the mean-value vector and the matrix © 
the covariance matrix of a multivariate random vector X. 

The proof of assertions (iva) and (ivb) uses directly the form (2.3.7) of 
the joint multivariate normal PDF fx(x). First, we discuss some algebraic 
preliminaries. (This actually will provide us with more properties of a mul- 
tivariate normal distribution.) 

It was stated in Section 2.1 that as a positive-definite matrix, © has a 
diagonal form (in the basis formed by its eigenvectors). That is, there exists 
an orthogonal matrix B = (b;;), with B = (B~!)? such that BDB~! = ¥ 
and BD~'!B-! = 5~!, where 


04 00, can, 0 Ge” 300. OF aaq:, O 
Di 0) 02° 0) wee 0) Pe DrtS |) o 52 0 0 
O' On. “Os sae-9e 0 0 O bes 


and At, A3,..., A2, are (positive) eigenvalues of ©. Note that det © = []/_, \? 
and (det ©)'/? = []L, X. 

If we make the orthogonal change of variables x 4} y = B'x with the 
inverse map y +> x = By and the Jacobian O(71,...,%n)/O(y1,---,Yn) = 
det B = +1, the joint PDF of the new RVs 
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is fy(y) = fx (By). More precisely, 


fy(y) 


— 1 1 
1 x) exp 5 (By - py) ‘(By i W) 


E 
1 a) ae |- ; (ae Bt yu)" B'BD 'B'B(y — B")| 
— 1 a) exp -5 (y — Bly)" D7! (y - BT) 


(ict) -o(-ae-orns"h 


Here, (Bly iF is the ith component of the vector Be We see that the 
RVs Yi,...,Yn are independent and Y; ~ N ((BT 11), sa). That is, the 
covariance 


Cov(¥i,¥;) = B[Y: - BT] [¥ — (BT) = . = 
gree 


Actually, the variables Y; will help us to do calculations with the RVs X;. 
For example, for the mean value of Xj: 


Ex; = f 73 exp | : 
iii gf, (2n)n/2Vdet © 2 


= / (By); exp | 1 
a (27)"/2./det 2 


yidji 1 Bi -1 T 
B D —B 
Sf aepriaes® [ay B.D Bt) ay 


<i<np 

= bji 1 BT,,)12b4 

= Ss" (2n)'2), yexp ~ayaly — ( L)i] y 
l<i<n % a 


= So (B w)ibji = by. 


Sign 


226 Continuous outcomes 


Similarly, for the covariance Cov(X;, Xj): 


n/2\/det D 


7 | [By — B*y)]; [Bly-B*y)], 
i: (2r)"/2/det © 


i SENG = #5) exp | 5 ix Hi) EN — ) * 
Ee 


R” 
1 a 
x exp [30 —B'y),D7'(y - B")| dy 
a S> Se Cov(Yi, Ym) bibjm 
1<l<n 1<m<n 


a De Dina dain — (BDB“");; — dij. 


1<m<n 


This proves assertion (ivb). For i = j it gives the variance Var X;. 
In fact, an even more powerful tool is the joint CHF x(t) defined by 


n 
x(t) = Eel) = exp [iS t:X; |, t? =(t,...,tn) ER” (2.3.8) 


The joint CHF has many features of a marginal CHF. In particular, it deter- 
mines the joint distribution of a random vector X uniquely. For multivariate 
normal vector X the joint CHF can be calculated explicitly. Indeed, 


n 
A 47 oT: . 
Reit™X _ peit™BY _ poi(BTt)* = B exp i (B"t), ¥;| 


Here each factor is a marginal CHF: 
H exp i (B't), ¥5| = exp (BT), (Bo'y), = (BTt)}23/2| ‘ 


As BDB~! = &, the whole product equals 


1 1 
exp ("BB = 3*"BDB-'t) = exp (ie = stat) : 
Hence, 
_ AitTp—-tT dt /2 
ux(t) =e : (2.3.9) 


Note a distinctive difference in the matrices in the expressions for the multi- 
variate normal PDF and CHF: formula (2.3.9) has ©, the covariance matrix, 
and equation (2.3.7) has ©~?. 
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Now, to obtain the marginal CHF wx,(t), we substitute vector t = 
(0,...,0,t,0,...,0) (¢ in position 7) into the RHS of equation (2.3.9): 


bx, (t) = exp (ijt — #25; /2). 


Owing to the uniqueness of a PDF with a given CHF, X; ~N(ju;, jj), as 
claimed in assertion (iva). 
As a by-product of the above argument, we immediately establish that: 


(ive) If X ~N(w,¥), then any subcollection (X;j,) is also jointly normal, 
with the mean vector (j1;,) and covariance matrix (%4,;,,). 


For characteristic functions we obtained the following property: 


(v) The joint CHF x(t) of a random vector X ~N(, X) is of the form 


. 1 
v(t) = exp (i (tn) — 5 (t,Et)). 
Finally, the tools and concepts developed so far also allow us to check that: 


(vi) A linear combination S>j".,c;X; of jointly normal RVs, with 
X ~N(u,%), is normal, with mean (c, 4) = Soyej;<, Gift; and vari- 
ance (c, Uc) = dic; j<n Citijcj- More generally, if Y is the random 
vector of the form Y = A?X obtained from X by a linear change of 
variables with invertible matrix A, then Y ~ N (AT, ATXA). See 
Example 3.1.1. (The last fact can also be extended to the case of a 
non-invertible A but we will leave this subject to later volumes.) 


Worked Example 2.3.6 Derive the distribution of the sum of n inde- 
pendent RVs, each having the Poisson distribution with parameter ». Use 
the CLT to prove that 


as n> &. 


Solution Let X1,...,Xp be IID Po (1). The PGF 


1 
wx,(s) = Es* => Fe te 1 seR'. 
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Now if S, = X1 +---+ Xn, then 


ws,,(8) — Wx, (s) 2,7 wx, (8), 
yielding that S, ~ Po (n), with ES, = Var(S;,) = n. By the CLT, 


So = 
t= = ~ N(0, 1) for n large. 
But 
_ nn? n” 
e aC nt a1 pees | =P(5,-< = P< OU) 


0 
1 2 1 
+ <= | e ¥/dy = = as n— 0. 
Von / oe 
—oo 
Worked Example 2.3.7 An algebraic question. If x,,...,Xn © R” are 
linearly independent column vectors, show that the n x n matrix 


n 
) xix; 
i=1 
is invertible. 


Solution It is sufficient to show that the matrix )7"_, x;x7 does not send 
any non-zero vector to zero. Hence, assume that 


n n n n 
(>: sx!) c= 0, ice. S- Serine = S > wig (Xi, c) =0,1<j<n, 
i=1 i=1 


k=1 i=1 


where 


x; = 2 (4° Leen ede 
Lin Cn 


The last equation means that the linear combination 


n 
> K(X; c) = 0. 
w=1 


Since the x; are linearly independent, the coefficients (x;,c) =0,1<i<n. 
But this means that c = 0. 


A couple of the following problems have been borrowed from advanced 
statistical courses; they may be omitted at the first reading, but are useful 
for those readers who aim to achieve better understanding at this stage. 
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Worked Example 2.3.8 Let Xj,...,Xn be an IID sample from normal 
distribution N(1,07) and Y;,..., Ym be an IID sample from normal distri- 
bution N(j12,03). Assume the two samples are independent of each other. 
What is the joint distribution of 


n 


1 ie 
the diff X —Y and th Soa Kee ye P 
e alirerence an e sum a ye 4 + S- j 


Solution We have X ~ N(11,07/n) and Y ~ N(u2, 03/n). Further, 


fex(es) =( ig ( "ex | nea mb |. 


2 2 
27074 2705 207 205 


We see that both 
= 1S i 
U=X—-YandS= 4 Xit+a iY = 
Ol a] 0% 1 
are linear combinations of (independent) normal RVs and hence are normal. 
A straightforward calculation shows that 


EU = 44 —p2, ES= “Hh + mies 


o7 o2 
and 
2 2 
Se. ae “+ ae 
mn Ga. ee 
So 
(mn)'/? [u — (441 — 2) ]? 
= ER 
fu(u) [27(mo? + no3)|!/2 ac! 2(mo? + nas)/(mn) J’ eG 
and 
fis(s) = — 


[27(mo? + no3)|!/2 


2 2 2 
2(mar + no 
con |: (Sm + =)| /# oe DN ser 
192 


Finally, U and S are independent as they have Cov(U, S) = 0. Therefore, 
the joint PDF fu,s(u,s) = fu(u)fs(s). 


Remark 2.3.9 The formulas for fy(u) and fs(s) imply that the pair (U, S) 
forms the so-called ‘sufficient statistic’ for the pair of unknown parameters 
(L141, 2); see Section 3.3. 
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Worked Example 2.3.10 Let X4,..., Xn» be a random sample from the 
N(@,c7) distribution, and suppose that the prior distribution for 6 is the 
N(, 77) distribution, where o?, 4: and r? are known. Determine the posterior 
distribution for 0, given X1,...,Xn. 


Solution The prior PDF is Gaussian: 
1 2 2 
(0) = e 0-H)? /(2r ) 
(9) V20T 


and so is the (joint) PDF of X1,...,X, for the given value of 0: 


i 1 2 2 
PX 450.Xn (By 02 Bnj3 0) = II e7 (#i-9)?/(207) 
<a V/V 210 


Thus 


leaving out terms not involving @. Here and below, 
ix 
i=1 
Then the posterior 


T (9) FX ooerXn (1, ++ +n 4) 
1 (0) fx4,...,Xn (£1, see Dn; 6’) de’ 


GOs teta) S i 


1 C= fin)\7 
V27T 27h 
where 
oe 2 _ p[t? +n&/o? 
T2072 g?? ica 1/72 + n/o? © 


Worked Example 2.3.11 Let 0, X1, Xo,... be RVs. Suppose that, condi- 
tional on 0 = 0,X 1, Xo,... are independent and X;, is normally distributed 
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with mean @ and variance o7. Suppose that the marginal PDF of 0 is 


1 62 
m0) = e 2,0€ER. 
(0) = 
Calculate the mean and variance of © conditional on X, = 7,...,Xn = Zn. 


Solution A direct calculation shows that the conditional PDF 


fox;,...,.Xn(9) = F(A|z1,---,2n) 


is a multiple of 


1 1 ie ye 
oo [-a (+X ax) (re) | 


with a coefficient depending on the values 71,...,%, of X1,...,Xn. This 


implies that the conditional mean 


#(O|X1,...,Xn) 


1 X; 
TF 3 1/o? 2 


and the conditional variance 
1 


Var(O|X1, re e424) = T+9, le?’ 


are independent of X1,...,Xn.- 


Worked Example 2.3.12 Let X and Y be IID RVs with the standard 


normal PDF 
1 2 
-_ —ax7/2 
z)= e ,c ER. 
fle) = 
Find the joint PDF of U = X + Y and V = X —Y. Show that U and V are 


independent and write down the marginal distribution for U and V. Let 


_ JIY|, if X > 0, 
Ei le SEE 0: 
By finding P(Z < z) for z < 0 and z > 0, show that Z has a standard 


normal distribution. Explain briefly why the joint distribution of X and Z 
is not bivariate normal. 


Solution Write (U,V) = T(X,Y). Then for the PDF: 


A(x, y) 
O(u, v) 


fov(u,v) = fxy(T~*(u, v)) 
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The inverse map is 


T (u,v) = es 1) , with H(z, 9) = = 
u 


Hence, 
for (tv) = Len Bute)? /4e (any? /ayb 2 2 P40) 

, 27 2 

i.e. U,V are independent. Next, if z > 0, then 
1 
PZ < Zz) = 5 (la4ePCY |<) =] PY <2); 
and if z < 0, then 
1 
PZ <2) = gP(-I¥| <2). =P Sebi PY 2 2)3 


So Z has the same standard normal distribution as Y. But the joint distri- 
bution of (X, Z) gives zero mass to the second and fourth quadrants; hence 
Z is not independent of X. 


Worked Example 2.3.13 Check that the standard normal PDF p(z) = 
eo 2/2 /V2m satisfies the equation 


i rp(x)dx = p(y), y > 0. 
y 


By using this equation and sin x = As cos y dy, or otherwise, prove that if 
X is an N(0,1) random variable, then 


(Ecos X)? < Var(sin X) < E(cos X)?. 


Solution Write 


1 ee 2 1 ae 2 x? 1 2 
oan —2£ /2q = =| —2x /2q (5) pe a) [2 
=| ci . V20 Jy m 2 V2n " 


Now 
1 2 
Esin. X = —— ie 2 sin x dx = 0, 
V2 
as e-” /2 sin is an odd function. Thus, 


Var(sin X) = E(sin X)° = f v(2)(sinz)?ax 


[oo Ck cosyay) de 
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Owing to the CS inequality, the last integral is 
< [oo [ (cos y)*dyda 
0 9 y oo 2 oo 
= -{ (cos y) i xp(x)dady +f (cos y) / xp(x)dady 
—oo 0 y 

0 ioe) 

= i; p(y) (cos y)*dy + i p(y)(cos y)*dy 
0 


—oo 


= f plu) (cosy)*ay = E(cos X)’, 


On the other hand, as EX? = 1, 


Var(sin X) = E(sin X)? = EX°E(sin X)? 


> E(X sin X)]? = | f e(ayac f cosyay] 


0 y oe) ioe) 
= - f cosy [ rp(z)dedy + | cosy [ ra)drdy 
—oo —oo 0 Yy 


. (| vw) 8) ‘y) (hear 


Worked Example 2.3.14 State the CLT for IID real RVs with mean pu 
and variance o?. 

Suppose that Xj, X2,... are IID RVs each uniformly distributed over the 
interval [0,1]. Calculate the mean and variance of In X1. 


Suppose that 0 < a < b. Show that 


2 


P((XiX2 ae ae ela: 0) 


tends to a limit and find an expression for it. 


Solution Let X,, Xo,... be IID RVs with EX; = p and Var _X; = o?. The 
CLT states that for all -co <a<b< oo: 


Ripa tial Ke 1 <7? 
lim P (« goes susie eel < s) = =| en? ay, 
n—oo Jno V2 Ja 


Moreover, if X ~ U[0, 1], then the mean value E(In X;) equals 


1 0 0 
fomear=— fy (ae) = (ve) + [eta ==. 
0 


(oe) (oe) 
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Similarly, the mean value E(In X;)? is equal to 


1 0 0 
[ (In X)"de =| y’d(e~") = (ye) | — 2 | e Yydy =2 
0 oo le) 
and 
Var (ln X,)S=2 = (1)? =4, 
Finally, 
P (MX . a ee ae E (a,0)) 
1 n 
= P | — In X;+ Vn € [In a,In b 
(ay naa 
1 n 
=P (= (sm Xj +) € [In a, ln a) 
In b 
> Ja | eda, by the CLT, as n > oo. 


Worked Example 2.3.15 The RV X; is normally distributed with mean 
py; and variance o?, for 1 = 1,2, and X; and X9 are independent. Find the 
distribution of Z = aX 1 + a2Xe, where aj,a2 € R. 

(You may assume that Ee?* = exp (0p; + 6207/2).) 


Solution The MGF 


bar X1+a2X2(9) = 
—E (een) E (c%%2) = $x, (a10)¢x,(a28), 


by independence. Next, 


y (eens) —E (aes) 


o2 292 52 
bx, (a10)bx,(a20) = exp (cu +" shes ED ae ota a) 


= exp oc ayy + agp) + 
) 


where Z ~N((aiju1 + a2f11), (afo7 + a303)). In view of the uniqueness of a 
PDF with a given MGF, 


(a, X1 + a2X2) ~ N((aipti + a2}l1), (azoz + a3o3)). 


2.3 Normal distribution 235 


Worked Example 2.3.16 Let X be a normally distributed RV with 
mean O and variance 1. Compute EX” for r = 0,1,2,3,4. Let Y be a 
normally distributed RV with mean py and variance o?. Compute EY” for 
r =0,1,2,3,4. State, without proof, what can be said about the sum of two 
independent RVs. 

The president of Statistica relaxes by fishing in the clear waters of Lake 
Tchebyshev. The number of fish that she catches is a Poisson variable with 
parameter A. The weight of each fish in Lake eae is an independent 
normally distributed RV with mean p and variance a7. (Since yw is much 


larger than a, fish of negative weight are rare and much sived by gourmets. ) 
Let Z be the total weight of her catch. Compute EZ and EZ?. 

Show, quoting any results you need, that the probability that the presi- 
dent’s catch weighs less than Aj/2 is less than 4(y? + 0?)\7!u-?. 


Solution First, EX® = E1 = 1 and EX! = EX? = 0, by symmetry. Next, 
EX? and EX® are found by integration by parts: 


ot = | (-207)/ + fas] =f, 
= Bore Pan i | (-z 3 eel" +3 fo? ee Pax] =3 = 


Further, 


Eye = haa, 

EY! =E(oX + pw) =cEX + pEl =p, 

LyY?= (ax + QuoX + i) =o7 + p?, 

By (3X3 + 307 X?2 + 320 X + uw?) = 3yuo? + py? 


BY! = E (04 X* + 4yo® X? + 6y?o07X? + 4yoX + 4) = 304 + 620? + pt. 


Now, if X1, X2 are independent normal RVs of means 41, {42 and variances 
Or, Ge, then X; + X9 is N(t1 + [2,07 + a3). 

Thus, if Y;. is the weight of r fish, then Y; ~ N(rp,ro?). 

Finally, 


KZ = S| P (catch jE = yo Ee ALL, 


r>0 r>0 
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and similarly 


rN —X x” Xr 
EZ? = Ss" 5 EY? = Ss" - (ro? + v7?) 
r>0 , r>0 
2 2 Nem 2 Neo 2 2 2,2 
= \o +S° I aC +e S > l P= Nope). 
r>1 r>1 


This yields 


VarZ = EZ? = (EZ)? = A(o? +p). 


Then, by Chebyshev’s inequality, 


2 2 
p(z<) <p(iz-aml>) < NEE ae ED) 


Worked Example 2.3.17 Let X and Y be independent and normally 
distributed RVs with the same PDF, 


1 eo #?/2 
V2 
Find the PDFs of: 
(i) X+Y; 
(ii) X?; and 


(iii) X?+Y?. 
Solution (i) For the CDF, we have 
1 
Fx+y(t) = =| [oer + y < t)dydz. 


As PDF e7(#*+¥*)/2 /2n is symmetric relative to rotations, the last expression 
equals 


1 21,2 t 1 i 2 
ae —(x*+y")/2 Pena onl oe S —u?/4 
20 ° : ¢ i 73) i aad 2/r ie a 


whence the PDF 


1 
fx+y(z) = aie 
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(ii) Similarly, for t > 0, 


1 
Fy2(t) = os / foervrig < t)dydax 
vt 


t 
1 2 1 du 
=Jrets —2£ /2q = =| —u/2% 
V 27 / i . V 20 7 Ju’ 
Vi 0 
which yields 


1 
fx2(v) = ——e7*/? I(x > 0). 
20x 


(iii) Finally, 


1 
Fy2,y2(t) => a [fee + y" < t)dyda 


oo 27 t 
1 2 1 
Ss (a) =r? /27 (72 eet 
5c J fr I(r* < t)d6dr 5 [e du, 
0 0 0 


and the PDF 


1 
fx24y2(a) = 50 P(e = 0). 


Worked Example 2.3.18 The PDF for the t-distribution with gq degrees 
of freedom is 


I ((q + 1)/2) (1 | as 
T (q/2)./7q q 

see equation (3.1.6). Using properties of the exponential function, and the 
result that 


f (234) = 


» CO <E< OO; 


v (5 +8) + vR(5) 7 ew (-9) 


as gq — ov, prove that f (x;q) tends to the PDF of an N(0,1) RV in this 


limit. 
Hint: Write 
#2 —(q+1)/2 p2\! —1/2-1/2q 
eae 
q qd 
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Derive the PDF of variable Y = Z?, where Z is N(0,1). The PDF for the 
F-distribution with (1, q) degrees of freedom is 


BC) i et a acs 
raya? (+E) 1 <a <o0 


Using the above limiting results, show that f (x;q) tends to the PDF of 
Yasqoo. 


g(#;q) = 


Solution (Second part only.) The PDF of Y equals I(a > 0)e~*/?/V/2ra. 
The analysis of the ratio of Gamma functions shows that 


(1 + “) ne — exp (-5) : 


Therefore, the PDF for the F-distribution tends to 


AE «exe (5) = pez exw(-5), 


as required. 
This result is natural. In fact, by Example 3.1.4, the F1,,-distribution is 
related to the ratio 
x? 


ye YP/q 


where X1, Yj,..., Yq are IID N(0,1). The denominator a Y}/q tends to 
las q— o@ by the LLN. 


Worked Example 2.3.19 Let X ~ N(ju, 07) and suppose h(x) is a smooth 
bounded function, « € R. Prove Stein’s formula, 


B[(X — w)h(X)] = o° E[h'(X)]. 


Solution Without loss of generality we assume ps = 0. Then 


—2? /20? 4 


B[Xh(X) 


1 
aa 
= ee _ 2-2? /207 
= | maa ( ae ) 
= a 
TO 


which holds because the integrals converge absolutely. 
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Worked Example 2.3.20 Let X ~ N(y,07) and let ® be the CDF of 
N(0, 1). Suppose that h(x) is a smooth bounded function, « € R. Prove for 
any real numbers 6, a, 8 that the following equations hold: 


z [e** n(X) | = ¢f9+07?/2B A(X + O02)] 
ap + B ) 
Vite): 
Solution Again, assume without loss of generality that u = 0. Then 
1 2 /yq2 
x | OX = Ox—x? /20 
Ee A(X = [e h(a)da 
| ) V 270? (2) 


1 29)2 Jon2 292 
= —(x—07°0)* /20° 0°60 /2n( )d 
e e L)aAXL 

V 270? i 
70/2 29)2 /(972 292 
= Tae e799)" 120") h(x)da = e? © /*E[h(X + 007)]. 
TO 


All the integrals here are absolutely converging. 


Ga 


®(aX + 8)| = o( 


In the proof of the second formula we keep a general value of pw. If Z ~ 
N(w, 1), independently of X, then 


E[®(aX + B)]) = P(Z < aX +B) =P(Z— a(X — p) < ap + B) 
i 
Vl+a2%o* ~ V1l+a2c? JV1+a2c02)- 


O] 


Worked Example 2.3.21 Suppose that (X,Y) has a jointly normal dis- 
tribution, and h(x) is a smooth bounded function. Prove the following rela- 
tions: 

Cov[X, Y] 
q(Y — EY )X] = X —EX 
BY —EY)x] = SOIT) y Ex) 


Cov[h(X), Y] = E [h'(X)] Cov[X, Y]. 


Solution Again, assume that both X and Y have mean zero. The joint PDF 


fxy(z,y) is 
scekp [See (7) . 


where © = (4j;) is the 2 x 2 covariance matrix. Then, conditional on X = z, 
the PDF fy(y|x) is 


on {3 (3-2) 9 9? + 2ay (E74) 9] } | 
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Recall that ©~! is the inverse of the covariance matrix ©. This indicates 
that the conditional PDF fy(y|x) is a Gaussian whose mean is linear in 2. 
That is, 


E(Y|X) = 7X. 


To find y, multiply by X and take the expectation. The LHS gives E(XY) = 
Cov|X, Y] and the RHS gives y Var X. The second equality follows from this 
result by Stein’s formula: 


Cov[h(X), Y] = E[A(X)Y] = E [h(X)E (Y|X)| 
- colt E [Xh(X)] = E [h’(X)] Cov[X, Y]. 


Worked Example 2.3.22 The PDF of Gaussian distribution N(6,) in 
IR” has the form 


= 1 = = 
f(y) = (2n)-"/?( det)" exp | - s(y - Ey — 8]. 
In the two-dimensional case we have 


f(y1,y2) = [2ro102(1 — p?)/?)-7 


i @ — 61)? — 2p(y1 — 91) (yo — 42) i (y2 — 2) 


x exp 
2(1 — p?) ot 0102 o3 


(2.3.10) 


sk ( o? P0102 ) 
P0102 Ge 


-1 


1 

St Ge a 
= : i : 

(1 p?) o102 oF 

(a) Let Y = (¥1, Y2) ~ N(0,X). Check that Y; and Y2 are independent if 
and only if p = 0. Let U = mt + ee V= x — = Prove that U and V are 
independent. 

(b) Check that the marginal distributions of the two-dimensional Gaussian 
distribution N(6, ©) coincide with Y; ~ N(01,07), Y2 ~ N(02, 03). 

(c) Compute the integral 


I= | | exp ( — (a? + Qay + 2y”))dady. 
0 0 
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(d) Let Y = (Yi,..., Yn)? be a random vector in R” with PDF 


: 1 “ 
f(y) = 20)" exp (= 5 uF) [1+] wexp (-?/2)]. 
i=1 
Prove that any n — 1 components of this vector are IID N(0,1) RVs. 


Solution (a) Under condition p = 0 the PDF factorises: f(y1,y2) = 
f(y1)f(y2). This means independence in the Gaussian case. The joint dis- 
tribution of (U,V) is Gaussian. It is easy to check that Cov(U,V) = 0. 

(b) Using the formula (2.3.10), we get 


= 2 2 — 2 
Pale) = Seep” (— ape) (St po) 
x | — | a a (eae a 1 a 
1 


(c) Comparing with (2.3.10) we observe that 


col, 
u= ( - °) : 
oD” D2 


Next, det u~! = i: So, 1; = any/4 =u; 


(d) By inspection an expression under the product sign is an odd function. 
So, it vanishes after integration with respect to any variable y;. 


Worked Example 2.3.23 (Quadratic forms in Gaussian RVs.) Let A be a 
symmetric n x n matrix and X ~ N(0,%). Derive the MGF of the quadratic 
form Q = XT AX: 


o(t) = Ee’? = (det (i= 2tAD)) (2.3.11) 


In fact, for ¢(t) = Ee’*'4* we have the following representations: 
= SY a 1 
o(t) = (2m)~"( det 5) a | exp (-For = 21AIX)) dx 
0 Jo 


= (det Z)~“?(det(E-1 - 24)? = (det (I- 24x) 
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Next, we compute the mean value and variance of Q: 


EQ = E[XTAX] = #/(0) = -5( _ 2tr(AE)) = tr(AB), 


where tr stands for the sum of diagonal elements. It is easy to check this 
formula directly: 


n 
E[XT AX] = ‘3 AijOig = tr( AX) 
ij=l 
because A is symmetric. 
Next, compute the variance. To this end, set f(t) = det (I - 2tB) where 
B= A®%. Then 


OM) = SFO 7") + OPO 


Note that 
f(t) =1 - 2ttr(B ) +42 (So diidss - So bj; ) + O(t°). 
i<j iAj 
Hence, 
8( So bisbiy — 32 03,) 
i<j i<j 


and we find that the variance 
Var (XT AX) = —3f”(0) + 3(tr[Ad])? — (tr[AX])? 
= 4( Dies b?, — Der biibyy) + 2(tr[Ax])? (2.3.12) 
= 2( ies b?.) + 25>, b?, = 2tr[B?] = 2tr[(Ax)?]. 
(a) Using the formula (2.3.11), compute the integral 
In = / | (a? + xy + 3y”) exp (- (x? + 2Qay + 2y”))dady. 
0 0 
(b) Let X ~N(0,07I). Using the formula (2.3.11), find the mean value 
ES, where 
S = (Xy — X2)? + (Xq — X3)? +++ + (Xn-1 — Xn)’. 


(c) Let X ~N(0,07I). Using the formula (2.3.12), find Var[S]. 
(d) Let X ~N(6,) be a random vector in R” and assume that det © 0. 
Prove that 


(X= 0) SE Kk eR a. 
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(e) Let X1,...,Xn ~ U[0,1] be IID RVs. Prove that 
—2In(X1X2...Xn) ~ x3n: (23.13) 


The distribution x2 is studied in detail in Example 3.1.2. Formula (2.3.13) 
gives a fast method of generating a random variable with distribution y3,,. 


Solution (a) Let us apply (2.3.11). In this example 


Ge 2 e a) = «& jas as) 


(5 %)5-(in 8) 
So, Iz = mtr[A¥] = 2r. 


(b) In this example the matrix A looks as follows: 


1 -1 -l 
ln 2 


So, ES = (2n — 2)o?. 
(c) Using (2.3.12), we obtain 


Var[S] = 20*tr[A?] = 204 S- ai, = (12n — 16)o". 
ij=l 
Here we use the following fact: for any symmetric matrix A, 
tr[A?] = Ss" ai. 
ij=l 
(d) The MGF of (K — 0)'=~1(X — @) equals 
|(1 — 2t)I|-¥/? = (4 — 28)-"/?, 


So it coincides with the MGF of the y?-distribution. 
(e) It is easy to check that —In(X;) ~ Exp(1). So, the MGF of x2 has the 
form 


(0) 8) ea aa 


and coincides with the MGF of the RV 2Y, where Y ~ Exp(1). 
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Concluding Part One, we demonstrate that Stirling’s formula may be 
derived from Chebyshev’s inequality. Let X,...,X, ~ Exp(1) be IID RVs. 
In this case EX = Var X = 1. By Chebyshev’s inequality, for any a > 0 we 


<a) 


1 1 


have 


P(=|> Xi-n| < a) -P(Fld (eee ne 


On the other hand, 5>;"., X; ~ Gam(n,1) with the PDF 


1 
(n—1)! 


We see that the LHS of (2.3.14) can be estimated as follows: 


1 Vnatn n-1/2,—-n pa n-1 
i a” le-tdzr = af (1 + i) eV dy 


ale" 1 (a > 0). 


f(«) = 


(n a 1)! —J/nat+n (n oat 1)! —a vn 
nr-1/2e-n a y? 
— a | exp (-4 + An) Jaw < il 


Observe that A,(y) = (n — 1)In(1+ tz) Vny 4 ¥ > 0 as n + 00 
uniformly over y for —a < y <a. 
Hence, the RHS of (2.3.14) satisfies 


1 nrth/2e-n a 
1—- <liminf “i / eo Y /2dy 
a n—00 nl a 
nrtl/2e—n a 
< limsup "| eW /2dy <i. 
noo nN. —~a 


n+1/2—-—n 


As n — oo we have a hea e-¥/2dy —> V2r and limy +... 27” 
Thus, Stirling’s formula (1.6.22) is proved. 


=], 


n! 


Problem 2.3.24 How large a random sample should be taken from a 
normal distribution in order for the probability to be at least 0.99 that 
the sample mean will be within one standard deviation of the mean of the 
distribution? 

Hint: ®(2.58) = 0.995. 


Problem 2.3.25 Let X 1, Xo,...be independent Cauchy RVs, each with 
PDF 
d 


Cs ae ae 
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Show that (X1 + Xo+---+ Xn) /n has the same distribution as Xj. 

Does this contradict the weak LLN or the CLT? 

Hint: The CHF of X; is e7!4l, and so is the CHF of (X14+Xo4+:- -+Xy) /n. 
The result follows by the uniqueness of the PDF with a given CHF. 

The LLN and the CLT require the existence of the mean value and the 
variance. 


PART “TWO 


BASIC STATISTICS 


3 


Parameter estimation 


3.1 Preliminaries. Some important probability distributions 


All models are wrong but some are useful. 
G.P.E. Box (1919-2013), American statistician. 


Model without a Cause. 
(From the series ‘Movies that never made it to the Big Screen’.) 


In the second half of this volume we discuss the material from the second year 
(Part IB) Statistics. This material will be treated as a natural continuation 
of the IA probability course. Statistics, which is called an ‘applicable’ subject 
by the Faculty of Mathematics of Cambridge University, occupies a place 
somewhere between ‘pure’ and ‘applied’ disciplines in the current Cambridge 
University course landscape. One modern definition is that statistics is a col- 
lection of procedures and principles for gaining and processing information 
in order to make decisions when faced with uncertainty. It is interesting to 
compare this with earlier interpretations of the term ‘statistics’ and related 
terms. Traditionally, the words ‘statistic’ and ‘statistics’ stem from ‘state’, 
meaning a political form of government. In fact, the words ‘statist’ appears 
in Hamlet, Act 5, Scene 2!: 


Hamlet: Being thus benetted round with villainies, 
Or I could make a prologue to my brains, 
They had begun the play. I sat me down, 
Devised a new commission, wrote it fair. 
I once did hold it, as our statists do, 
A baseness to write fair, and laboured much 


1 William Shakespeare. Hamlet, Prince of Denmark (ed. Philip Edwards). Cambridge 
University Press, 1985. 
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How to forget that learning; but sir, now 
It did me yeoman’s service. Wilt thou know 
Th’effect of what I wrote? 


and then in Cymbeline, Act 2, Scene 4?: 


Posthumus: I do believe, 
Statist though I am none, nor like to be, 
That this will prove a war; .... 


The meaning of the word ‘statist’ seems to be a person performing a state 
function. (The glossary to The Complete Works by William Shakespeare. 
The Alexander Text (London and Glasgow: Collins, 1990) simply defines 
it as ‘statesman’.) In a similar sense, the same word is used in Milton’s 
Paradise Regained, The Fourth Book, Line 355. 

Many of the definitions of statistics that appeared before 1935 can be 
found in [152]; their meaning is essentially ‘a description of the past or 
present political and financial situation of a given realm’. Characteristically, 
Napoleon described statistics as ‘a budget of things’. 

The definition of statistics remained a popular occupation well after 1935 
(104], with a wide variety of opinions expressed by different authors (and 
sometimes by a single author over an interval of time). Political and ideolog- 
ical factors added to the confusion: Soviet-era authors concertedly attacked 
Western writers for portraying statistics as a methodological, rather than 
a material, science. The limit of absurdity was to proclaim the existence of 
‘proletarian statistics’, as opposed to ‘bourgeois statistics’. The former was 
helping in the ‘struggle of the working class against its exploitators’, while 
the latter was ‘a servant of the monopolistic capital’. 

A particularly divisive issue became the place and role of mathematical 
statistics. For example, G.E.P. Box (1919-2013), a British-born American 
statistician who began his career as a chemistry student and then served 
as a practising statistician in the British Army during World War II, wrote 
that it was a ‘mistake to invent the term “mathematical statistics”. This 
grave blunder has led to a great number of difficulties’. 

It is interesting to compare this with two rather different sentences by 
J.W. Tukey (1915-2000), one of the greatest figures of all time in statistics 
and many areas of applied mathematics, credited, among many other things, 
with the invention of the terms ‘bit’ (short for binary digit) and ‘software’. 
Tukey, who trained as a pure mathematician (his Ph.D. was in topology), 
said: ‘Statistics is a part of a perplexed and confusing network connecting 


? William Shakespeare, Cymbeline (ed. John Dover Wilson). Cambridge University Press, 1969. 
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mathematics, scientific philosophy and other branches of science, including 
experimental sampling, with what one does in analysing and sometimes 
in collecting data.’ On the other hand, Tukey also expressed the opinion: 
“Statistics is a part of applied mathematics which deals with (although not 
exclusively) stochastic processes”. The latter point of view was endorsed in 
a substantial number of universities (Cambridge included) where many of 
the members of Statistics Departments or units (including holders of Chairs 
of Mathematical Statistics) are in fact specialists in stochastic processes. 

In the statistics part of this book one learns various ways to process ob- 
served data and draw inferences from it: point estimation, interval estima- 
tion, hypothesis testing and regression modelling. Some of the methods are 
based on a clear logical foundation, but some appear ad hoc and are adopted 
simply because they provide answers to (important) practical questions. 

It may appear that, after decades of painstaking effort (especially in the 
1940s-1970s), attempts to provide a unified rigorous foundation for modern 
statistics have nowadays been all but abandoned by the majority of the 
academic community. (This is perhaps an overstatement, but it is how it 
often seems to non-specialists.) However, such an authority as Rao (of the 
Rao-Blackwell Theorem and the Cramér—Rao inequality; see below) stresses 
that ties between statistics and mathematics have only become stronger and 
more diverse. 

On the other hand, during the past 30 years there has been a spectacular 
proliferation of statistical methods in literally every area of scientific analy- 
sis, the main justification of their usefulness being that these methods work 
and work successfully. The advent of modern computational techniques (in- 
cluding the packages SPSS, MINITAB and SPLUS) has made it possible to 
analyse huge arrays of data and display results in accessible forms. One can 
say that computers have freed statisticians from the grip of mathematical 
tractability [149]. 

It has to be stressed that even (or perhaps especially) at the level of an 
initial statistics course, accurate manual calculations are extremely impor- 
tant for successful examination performance, and candidates are advised to 
pay serious attention to their numerical work. 

The prerequisite for IB Statistics includes brushing up on knowledge of 
some key facts from IA Probability. This includes basic concepts: probability 
distributions, PDFs, RVs, expectation, variance, joint distributions, covari- 
ance, independence. It is convenient to speak of a probability mass function 
(PMF) in the case of discrete RVs and a PDF in the case of continuous 
ones. Traditionally, statistical courses begin with studying some important 
families of PMFs/PDFs depending on a parameter (or several parameters 
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forming a vector). For instance, Poisson PMF's, Po(A), are parametrised by 
> 0, and so are exponential PDFs, Exp(). Normal PDFs are parametrised 
by pairs (4,07), where yz € R is the mean and o? > 0 the variance. The ‘true’ 
value of a parameter (or several parameters) is considered unknown and we 
will have to develop the means to make a judgement about what it is. 

A significant part of the course is concerned with IID N(0,1) RVs 
X1,X2,... and their functions. The simplest functions are linear combi- 
nations )>;"_ aiXj. 


Example 3.1.1 Linear combinations of independent normal RVs. We have 
already discussed linearity properties of normal RVs in Section 2.3; here we 
recall them with minor modifications. Suppose that X1,..., Xp are indepen- 
dent, and X; ~ N(i, 07). Their joint PDF is 


Ly 


n 
x)= ex Ly )/oz|, x=]: | ER” (3.1.1 
fxs) = T] Jpee_ ex [ple — mV : (3.141) 
= os 
Then for all real aj,...,@n, 
So aX; ~s(E can Det] (3.1.2) 
In-particular; if Gp — ie = a,.— Lf jie SH Sg = ad Oy Se = 


On = 0, then 


aes 5 
-S\- Xx; ~N(u ~). (3.1.3) 
n 4 n 
i=1 
On the other hand, 
Dee — Hi) / (>: “) ~ N(0,1). 
i=1 i=1 


Next, suppose A = (Ajj) is a real invertible n x n matrix, with det A ¥ 0, 
and the inverse matrix A~! = (Aj). Write 


X a4 
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and consider the mutually inverse linear transformations Y = A'X and 
XK = (A!) Y, with 


= (ATX), = YONA, X= (a) TY). = DMA 


Then the RVs Yj,...,Y, are jointly normal. More mee the joint PDF 
fy(y) is calculated as 


fy(y) = aa [(a- y"y| 


1 1 1. fe 7 
= Siew 
aaa ll ge |-a (so , «] 


al 1 
~ (27)"/? (det (ATEA)]? 


x exp -5 ((y — A™y), (ATSA)* (y — ATy))| ‘ 


Here, as before, (,) stands for the scalar product in R”, and 


oy O. 0 

M1 Y1 0 o3 0 
b=]: 1], y= ER” and }= ; 

ee ve 00... ce 


Recall, 4: and © are the mean vector and the covariance matrix of X. 


We recognise that the mean vector of Y is AT y and the covariance matrix 
is ATDA: 


n n 
EY; = S > Aight, Cov(Y¥j, Yj) = So Anionr Any: 
i=1 k=1 
Now suppose A is a real orthogonal n x n matrix, with )°, Api Aj = dij, 
ie. ATA equal to the unit n x n matrix. Then det A = £1. Assume that the 


above RVs X; have the same variance: of = --- = 02 = 07. Then 


Cov(¥j, Yj) = Cov (ATX), (ATX) ate 3 AgiArj Cov (Xe, X1) 


kl=1 
= 07 SO Api Arjona = 6 Dee = 076i. 
kl 
That is, random vector XA = YT has independent components Yj,..., Yn, 


with Y; ~ N (AT), a. 
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Example 3.1.2 Sums of squares: the x? distribution. Another example 
repeatedly used in what follows is the sum of squares. Let X1, Xo,... be 
IID N(0,1) RVs. The distribution of the sum 


n 
Be 
i=1 


is called the chi-square, or x? distribution, with n degrees of freedom, or 
shortly the x2 distribution. As we will check below, it has the PDF 2 
concentrated on the positive half-axis (0, 00): 


fya (x) x g/2-le-#/2 T(x > 0), 


with the constant of proportionality 1 /|T (n/2)2"/?| . Here 


Oe if 
TI (n/2 =) —_g/2-1- 2/245 
(ni)= [xa 


One can recognise the y2 distribution as Gam(a, \) with a = n/2, \ = 
1/2. On the other hand, if X1,..., Xp, are IID N(y, 07), then 


n 
SX —p)* ~ Gam (35 55 pa) and 52 2k ~ x2. (3.1.4) 
i=1 
The mean value of the x2 distribution equals n and the variance 2n. All x? 
PDFs are unimodal. A sample of graphs of PDF f\2 is shown in Figure 3.1. 
A useful property of the family of y? distributions is that it is closed under 
independent summation. That is if Z ~ 2 and Z! ~ AG, independently, 
then Z+ Z! ~ x2 4nr- Of course, x? distributions inherit this property from 
Gamma distributions. 
A quick way to check that 


1 
fala) = T (n/2) — 


gir se Tes 0) (3.1.5) 


is to use the MGF or CHF. The MGF My2(6) = Be!X? equals 


Oa? a PPdg = e7 (1-20) 27/24» 
V2 ae fe a 


= Ss re d= 

a V2 1 — 20 

which is the MGF of Gam(1/2,1/2). Next, the MGF My, (t) of Y, = 
yi, X? is the power (14.2(0)) = (1 — 20)-"/?. This is the MGF of the 
Gam(n/2,1/2) distribution. Hence fy, ~ Gam(n/2, 1/2), as claimed. 
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0.1 
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L 


Figure 3.1 The chi-square PDFs. 


Example 3.1.3 The Student t-distribution. If, as above, X1, X2,... are 
IID N(0,1) RVs, then the distribution of the ratio 


Xn41 


(3 x?/n) . 


71 


is called the Student distribution, with n degrees of freedom, or the t,, dis- 
tribution for short. It has the PDF /f;,, spread over the whole axis R and is 
symmetric (even) with respect to the inversion 7 ++ —a: 


v] 


P —(n+1)/2 
fil) (145) 


with the proportionality constant 


1 Pim+n/2 
Van T(n]2) 


For n > 1 it has, obviously, the mean value 0. For n > 2, the variance is 
n/(n—2). All Student PDFs are unimodal. A sample of graphs of PDF f;,, is 
shown in Figure 3.2. These PDF's resemble normal PDFs (and, as explained 
in Worked Example 2.3.18, f,,(t) approaches e~“’/2/,/2m as n — co). How- 
ever, for finite n, the ‘tails’ of f,,, are ‘thicker’ than those of the normal 
PDF. In particular, the MGF of a t distribution does not exist (except at 
6 =0): if X ~ tn, then Ee’* = oo for all 6 € R \ {0}. 
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0.4 
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Figure 3.2 The Student PDFs. 


Note that for n = 1, the t; distribution coincides with the Cauchy distri- 
bution. 

To derive the formula for the PDF f;,, of the t,, distribution, observe that 
it coincides with the PDF of the ratio T = X/\/Y/n, where RVs X and Y 
are independent, and X ~ N(0,1), Y ~ y2. Then 


1 g—n/2 
Pex ew) = Ter tap 


The Jacobian O(t, u)/O(x, y) of the change of variables 


27/2 jnf2-Le— 9/2 Ty 0). 


equals (n/y)!/? and the inverse Jacobian O(a, y)/O(t,u) = (u/n)'/?. Then 


Fin (t) = fr(t) eon (t(u/n)"/?,u) on a 
0 
1 g-n/2 A ms 
~ Vand ci en Pulanynl?tew? (TY du 
1 g—n/2 


—(140?/n)u/2,,(n+1)/2-lqa,, 


The integrand in the last expression comes from the PDF of the distribution 
Gam ((n + 1)/2,1/2 + t?/(2n)). Hence, 


1 I ((n+1)/2) 1 weve 
Jan T(n/2) (a) ene) 


which gives the above formula. 


fin (t) = 
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Example 3.1.4 The Fisher F-distribution. Now let Xj,...,Xm and 
Yi,...,Y;, be IID N(0,1) RVs. The ratio 


m 

>, X?/m 
i=1 

n 


¥ ¥2/n 


j=l 


has the distribution called the Fisher, or F-distribution with parameters (de- 
grees of freedom) m, n, or the Fm,» distribution for short. The corresponding 
PDF ff,,,,, is concentrated on the positive half-axis: 


—(m-+n)/2 
) ' I(x > 0), Ga9) 


finn (2) x o™/?-1 ( ey 
n 
with the proportionality coefficient 


T' ((m + n)/2) (aye 
T (m/2)T (n/2) \n ; 


The F distribution has the mean value n/(n — 2) (for n > 2, independently 


of m) and variance 


2n?(m +n — 2) 
m(n — 2)?(n — 4) 


(for n > 4). 


Observe that 
if Z~ ty, then Z?7~ Fi», andifZ~ Fry, then Z-1~ Fam. 


A sample of graphs of PDF ff,,,,, is plotted in Figure 3.3. 
The Fisher distribution is often called the Snedecor—Fisher distribution. 
The above formula for the PDF fp,,,, can be verified similarly to that for 


fin; we omit the corresponding calculation. 


° 
= 


2 
o 


© 
o 


= 
So 


N 
o 


Figure 3.3 The Fisher PDFs. 
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Figure 3.4 The Beta PDFs. 


Example 3.1.5 The Beta distribution. The Beta distribution is a proba- 
bility distribution on (0,1) with the PDF 


f(x) x a? 1(1 —a)P 1100 < & <1), (3.1.8) 


where a, @ > 0 are parameters. The proportionality constant equals 


I (a+ B) 1 


P(a)P (8) Bla, 6)’ 


where B(a,() is the Beta function. We write X ~ Bet(a, 8) if RV X has 
the PDF as above. A Beta distribution is used to describe various random 
fractions. It has 


ay _ aB 
ate “= @tPlatst) 


EX = 


Beta PDF plots are shown in Figure 3.4. 
It is interesting to note that 


(m/n)X mxX 


mn 
if X ~ Fran th - ~ Bet (=, 5). 
: vee “" T+ (m/n)X n+mxX NED 


For further examples we refer the reader to the tables of probability dis- 
tributions in the Appendix. 
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It will be important to work with quantiles of these (and other) distribu- 
tions. Given y € (0,1), the upper y-quantile, or upper y-point, a,(y), of a 
PMF/PDF f is determined from the equation 


(oe) 


S> f(z) =, or / Hes anes 


2>a4(7) 250) 


Similarly, the lower y-quantile (lower y-point) a_(y) is determined from the 
equation 


a_(y) 
S- f(z)=7, or a f(e)da =. 
a<a_(7) = 
Of course in the case of a PDF 
a_(y)=a;(1-7), O0<7<1. (3.1.9) 


In the case of a PMF, equation (3.1.9) should be modified, taking into ac- 
count wheteher the value a_ (vy) is attained or not (i.e. whether f(a_(y)) > 0 
or f(a_()) = 0). 

If we measure y as a percentage, we speak of percentiles of a given dis- 
tribution. Quantiles and percentiles of a normal, of a \?,of a t- and of an 
F-distribution can be found in standard statistical tables. Modern packages 
allow one to calculate them with a high accuracy for practically any given 
distribution. 

Some basic lower percentiles are given in Tables 3.1 and 3.2 below (cour- 
tesy of R. Weber). For points of the normal distribution, see also Table 1.1 
in Section 1.6. 

These tables give values of x such that a certain percentage of the distri- 
bution lies less than x. For example, if X ~ tz, then P(X < 5.84) = 0.995, 
and P(—5.84 < X < 5.84) = 0.99. If X ~ Fs, then P(X < 4.82) = 0.95. 

The tables can be used to conduct various hypothesis tests with sizes 
0.01, 0.05 and 0.10. For the F distribution, only the 95 percent point is 
shown; this is what is needed to conduct a one-sided test of size 0.05. Tables 
for other percentage points can be found in any statistics book or can be 
calculated using computer software. (Tables 3.1 and 3.2 were constructed 
using functions available in Microsoft Excel.) 

Note that the percentage points for t, tend to those for N(0,1) as 
n> ow. 
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% points of tn % points of N(0,1) 
W995 09D 75 OE] [0.995 0.99 0975 095 0.90 
163.66 31.82 12.71 631 358 2.33. 1.96 1.645 1.282 
2 992 696 430 2.92 
3 5.84 454 3.18 2.35 
4 460 3.75 2.78 2.13 % points of x2 
5 4.03 3.36 2.57 2.02 n 0.99 0.975 0.95 0.9 
6 3.71 314 2.45 1.94 (86a Se ae SOT 
7 3.50 3.00 2.36 1.89 5 991 738 +599 A6l 
8 3.36 2.90 2.31 1.86 3 11.34 935 7.81 6.25 
9 = 3.25 2.82 2.26 1.83 4 13.28 1114 9.49 7.78 

100 3.17 2.76 2.23) 1.81 5 15.09 12.83 11.07 9.24 
11 3.11 2.72 2.20 1.80 6 16.81 14.45 12.59 10.64 
12 3.05 2.68 = 2.18 1.78 7 18.48 16.01 14.07 12.02 
13° 3.01 2.65 2.16 1.77 8 20.09 17.53 15.51 13.36 
14 2.98 2.62 2.14 1.76 9 21.67 19.02 16.92 14.68 
15 2.95 2.60 2.13 1.75 10 23.21 20.48 18.31 15.99 
16 2.92 2.58 2.12 1.75 11 24.73 21.92 19.68 17.28 
17 2.90 2.57 2.11 1.74 12 26.22 23.34 21.03 18.55 
18 2.88 2.55 2.10 1.73 13 27.69 24.74 22.36 19.81 
19 2.86 2.54 2.09 1.73 14 29.14 26.12 23.68 21.06 
20° 2.85 2.53 2.09 1.72 15 30.58 27.49 25.00 22.31 
21 2.83 2.52 2.08 1.72 16 32.00 28.85 26.30 23.54 
220 2.82 251 2.07 1.72 17 33.41 30.19 27.59 24.77 
23 2.81 2.50 2.07 1.71 18 34.81 31.53 28.87 25.99 
240 2.800 2.49 2.06 1.71 19 36.19 32.85 30.14 27.20 
25° 2.79 2.49 2.06 1.71 20 37.57 34.17 31.41 28.41 
26 2.78 2.48 2.06 1.71 30 50.89 46.98 43.77 40.26 
270 2.77 2.47 2.05 1.70 40 63.69 59.34 55.76 51.81 
28 2.76 «2.47 = 2.05 1.70 50 76.15 71.42 67.50 63.17 
29° 2.76 2.46 2.05 1.70 60 88.38 83.30 79.08 74.40 
30 2.75 2.46 2.04 1.70 70 100.4 95.02 90.53 85.53 
40 2.70 242 2.02 1.68 80 112.3 106.6 101.8 96.58 
60 266 2.39 2.00 1.67 90 124.1 1181 113.1 4107.5 
120 2.62 2.36 1.98 1.66 100 135.8 129.5 124.3 118.5 
Table 3.1 


3.2 Estimators. Unbiasedness 


License to Sample. 

You Only Estimate Twice. 

The Estimator. 

(From the series ‘Movies that never made it to the Big Screen’.) 


We begin this section with the concepts of unbiasedness and sufficiency. 
The main model in Chapters 3 and 4 is one in which we observe a sample 


3.2 Estimators. Unbiasedness 261 


95% points of Fp, no 
I 2 3 4 #5 6 8 2 6 2 3 40 50 


3 
S 


161.4 199.5 215.7 224.5 230.1 233.9 238.8 243.9 246.4 248.0 250.1 251.1 251.7 
18.51 19.00 19.16 19.25 19.30 19.33 19.37 19.41 19.43 19.45 19.46 19.47 19.48 
10.13 9.55 9.28 9.12 9.01 894 8.85 8.74 869 866 8.62 8.59 8.58 
7.71 694 6.59 639 6.26 616 6.04 5.91 5.84 5.80 5.75 5.72 5.70 
6.61 5.79 5.41 5.19 5.05 495 4.82 4.68 460 4.56 450 4.46 4.44 
5.99 5.14 4.76 453 4.39 4.28 415 4.00 3.92 3.87 3.81 3.77 3.75 
5.59 4.74 4.35 412 3.97 3.87 3.73 3.57 3.49 3.44 3.38 3.34 3.32 
5.32 446 4.07 3.84 3.69 3.58 3.44 3.28 3.20 3.15 3.08 3.04 3.02 
5.12 4.26 3.86 3.63 3.48 3.387 3.23 3.07 2.99 2.94 2.86 2.83 2.80 
10 | 4.96 410 3.71 3.48 3.33 3.22 3.07 2.91 2.83 2.77 2.70 2.66 2.64 


CoItankwnHs 


11 | 484 3.98 3.59 3.36 3.20 3.09 2.95 2.79 2.70 2.65 2.57 2.53 2.51 
12] 4.75 3.89 3.49 3.26 3.11 3.00 2.85 269 260 2.54 2.47 2.43 2.40 
13 | 467 3.81 3.41 3.18 3.03 2.92 2.77 2.60 2.51 2.46 2.38 2.34 2.31 
14] 460 3.74 3.34 3.11 2.96 2.85 2.70 2.53 2.44 2.39 2.31 2.27 2.24 
15 | 4.54 3.68 3.29 3.06 2.90 2.79 2.64 2.48 2.38 2.33 2.25 2.20 2.18 
16 | 449 3.63 3.24 3.01 2.85 2.74 2.59 2.42 2.33 2.28 2.19 2.15 2.12 
17 |} 445 3.59 3.20 2.96 2.81 2.70 2.55 2.38 2.29 2.23 2.15 2.10 2.08 
18 | 441 3.55 3.16 2.93 2.77 2.66 2.51 2.34 2.25 2.19 2.11 2.06 2.04 
19 | 4.38 3.52 3.13 2.90 2.74 263 2.48 2.31 2.21 2.16 2.07 2.03 2.00 
20 | 4.35 3.49 3.10 2.87 2.71 2.60 2.45 2.28 2.18 2.12 2.04 1.99 1.97 
22 | 430 3.44 3.05 2.82 2.66 2.55 2.40 2.23 2.13 2.07 1.98 1.94 1.91 
24 | 4.26 3.40 3.01 2.78 2.62 2.51 2.36 2.18 2.09 2.03 1.94 1.89 1.86 
26 | 4.23 3.37 2.98 2.74 2.59 2.47 2.32 2.15 2.05 1.99 1.90 1.85 1.82 
28 | 4.20 3.34 2.95 2.71 2.56 2.45 2.29 2.12 2.02 1.96 1.87 1.82 1.79 
30 | 4.17 3.32 2.92 269 2.53 2.42 2.27 2.09 1.99 1.93 1.84 1.79 1.76 
40 | 4.08 3.23 2.84 2.61 2.45 2.34 2.18 2.00 1.90 1.84 1.74 1.69 1.66 


50 | 4.03 3.18 2.79 2.56 2.40 2.29 2.13 1.95 1.85 1.78 1.69 1.63 1.60 
60 | 4.00 3.15 2.76 2.53 2.37 2.25 2.10 1.92 1.82 1.75 1.65 1.59 1.56 
70 | 3.98 3.13 2.74 2.50 2.35 2.23 2.07 1.89 1.79 1.72 1.62 1.57 1.53 
80 | 3.96 3.11 2.72 2.49 2.33 2.21 2.06 1.88 1.77 1.70 1.60 1.54 1.51 
100| 3.94 3.09 2.70 2.46 2.31 2.19 2.03 1.85 1.75 168 1.57 1.52 1.48 


Table 3.2 


of values of a given number n of IID real RVs X1,...,Xn, with a common 
PMF/PDF f(z;6). The notation f(x; 6) aims to stress that the PMF/PDF 
under consideration depends on a parameter # varying within a given range 
0. The joint PDF/PMF of the random vector X is denoted by fx(x; 0) or 
f(x;@) and is given by the product 


fx(x50) = flr; @)--- fans), R=]: ],x=][ 2]. G24) 


Here, and below the vector x is a sample value of X. (It follows the tradition 
where capital letters refer to RVs and small letters to their sample values.) 
The probability distribution generated by fx(- ;@) is denoted by Pg and 
the expectation and variance relative to Pg by Eg and Varg. 
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In a continuous model where we are dealing with a PDF, the argument x is 
allowed to vary in R; more precisely, within a range where f(x; 6) > 0 for at 
least one 0 € O. Similarly, x € R” is a vector from the set where fx(x;@) > 0 
for at least one value of 6 € O. In a discrete model where f(x; 0) isa PMF, x 
varies within a specified discrete set V C R (say, Z; = {0,1,2,...}, the set 
of non-negative integers in the case of a Poisson distribution). Then x € V” 
is a vector with components from V. 

The subscript X in the notation fx(x; 6) will often be omitted in the rest 
of the book. 

The precise value of parameter @ is unknown; our aim is to ‘estimate’ 
it from sample x = (2,...,%). This means that we want to determine a 
function 6*(x) depending on sample x but not on @ which we could take as 
a projected value of #. Such a function will be called an estimator of 0; its 
particular value is often called an estimate. (Some authors use the term ‘an 
estimate’ instead of ‘an estimator’; others use ‘a point estimator’ or even 
‘a point estimate’.) For example, in the simple case in which @ admits just 
two values, 09 and 61, an estimator would assign a value 09 or 6; to each 
observed sample x. This would create a partition of the sample space (the 
set of outcomes) into two domains, one where the estimator takes value 0 
and another where it is equal to 0;. In general, as was said, we suppose 
that 6 € ©, a given set of values. (For instance, values of 6 and 0* may be 
vectors. ) 

For example, it is well known that the number of hops by a bird before 
it takes off is described by a geometric distribution. Similarly, emission of 
alpha-particles by radioactive material is described by a Poisson distribution 
(this follows immediately if one assumes that the emission mechanism works 
independently as time progresses). However, the parameter of the distribu- 
tion may vary with the type of bird or the material used in the emission 
experiment (and also other factors). It is important to assess the unknown 
value of the parameter (q or A) from an observed sample 7j,..., 2p, where 
x; is the number of emitted particles within the ith period of observation. 
In the 1930s when the experimental techniques were very basic, one simply 
counted the emitted particles visually. At Cambridge University, people still 
remember that E. Rutherford (1871-1937), the famous physicist and Direc- 
tor of the Cavendish Laboratory, when recruiting a new member of staff, 
asked two straightforward questions: ‘Have you got a First?’ and ‘Can you 
count?’. Answering ‘Yes’ to both questions was a necessary condition for 
being hired. 

In principle, any function of x can be considered as an estimator, but in 
practice we want it to be ‘reasonable’. We therefore need to develop criteria 
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for which estimator is good and which bad. The domain of statistics that 
emerges is called parametric, or parameter, estimation. 


Example 3.2.1 Let Xj,...,X, be IID and X; ~ Po(A). An estimator of 
\ = EX; is the sample mean X, where 


Sa 
X=-S°Xi. (3.2.2) 
nr 
1=1 


Observe that nX ~ Po(nA). We immediately see that the sample mean 
has the following useful properties: 


(i) The random value X = 0%_,X;/n is grouped around the true value of 
the parameter: 


seat. yh 
EX = — EX; = EX; =4. 3.2.3 


This property is called unbiasedness and will be discussed below in 


detail. 
(ii) X approaches the true value as n — oo: 
P ( lim X = ) =1 (the strong LLN). (3.2.4) 
MOO 


Property (ii) is called consistency. 


An unbiased and consistent statistician? 
This is the complement to an event of probability 1. 
(From the series ‘Why they are misunderstood’. ) 


(iii) For large n, 
rE. —) ~ N(0,1) (the CLT). (3.25) 
This property is often called asymptotic normality. 


Even when statisticians are normal, 
in most cases they are only asymptotically normal. 
(From the series ‘Why they are misunderstood’. ) 


We are also able to see that X has another important property. 
(iv) X has the minimal mean square error in a wide class of estimators 2*: 


(=A) Seay. (3.2.6) 
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Example 3.2.2 Let Xj1,...,Xn be IID and X; ~ Bin(k, p). Recall, EX; = 
kp, Var X; = kp(1 — p). Suppose that k is known but value p = EX;/k € 
(0,1) is unknown. An estimator of p is X/k, with nX ~ Bin(kn,p). Here, 
as before: 
(i) EX /k = p (unbiasedness). 

(ii) P (limpoo X/k = p) = 1 (consistency). 

(iii) W/kn/[p(1 — p)] (X/k — p) ~ N(O, 1) for n large (asymptotic normality). 
(iv) X/k has the minimal mean square error in a wide class of estimators. 


Now what if we know p but value k = 1,2,... is unknown? In a similar 
fashion, X /p can be considered as an estimator of k (never mind that it takes 


non-integer values!). Again, one can check that properties (i)—(iii) hold. 


Example 3.2.3 A frequent example is where X1,..., X, are IID and X; ~ 
N(, 07). When speaking of normal samples, one usually distinguishes three 
situations: 


(I) the mean yw € R is unknown and variance o? known (say, a? = 1); 
(II) yz is known (say, equal to 0) and o? > 0 unknown; and 
(III) neither p» nor o is known. 


In cases (I) and (III), an estimator for jz is the sample mean 


ea ee ee | 
X=-)_X;, with EX =—) EX; =EX, =u. (32:7) 
n n 
i=l a 


From Example 3.1.1 we know that X ~ N(y,0?/n); see equation (3.1.3). 


In case (II), an estimator for o? is E*/n, where 


m=) \(Xi—n), with ED = \E(X;—p)? = nVar Xi = no?, (3.2.8) 


a a 


and E(=*/n) =o’. From Example 3.1.2 we deduce that YE /o? ~ x2. 
With regard to an estimator of o? in case (III), it was established that 


setting 


n 


Sxx =) (Xi -X) (3.2.9) 
i=1 


yields 


1 1 
of ier = —_ESyy =o". 3.2.10 
( xx) Rae ( ) 


See Worked Example 1.4.8 where this fact was verified for IID RVs with an 
arbitrary distribution. Hence, an estimator of a? is provided by Sx x/(n—1). 
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We are now able to specify the distribution of Sy x/o? as y2_,; this is a 
part of the Fisher Theorem (see p. 274). 


So, in case (III), the pair 
€ sxx) 
n—-1 


can be taken as an estimator for vector (1,07), and we obtain an analogue 
of property (i) (joint unbiasedness): 


(EX,B2EX) = (u,07) 


n—-1 


Also, as n — 00, both X and Sxx/(n—1) approach the estimated values 
wand o?: 


P (im X=np, lim SXX 


n—0oo no Nn — 


= a?) = 1 (again the strong LLN). 


This gives an analogue of property (ii) (joint consistency). For X this 
property can be deduced in a straightforward way from the fact that 
X ~ N(p,07/n) and for Sxx/(n —1) from the fact that Syx/o? ~ x2_}. 
The latter remarks also help to check the analogue of property (iii) (joint 
asymptotic normality): as n — co 


Vvn—-1 (= 


2 n—-1 


Me ogy ONO), 


a?) ~ N(0,2), independently. 
o o 


In other words, the pair 


(exw AG (85) 


oO o n—-1 


is asymptotically bivariate normal, with 


1 
,) and the covariance matrix( a 


mean vector ( 0 0 9 


When checking this fact, you should verify that the variance of Sx x equals 
2(n — 1)o%. 

An analogue of property (iv) also holds in this example, although we 
should be careful about how we define minimal mean square error for a 


vector estimator. 


It is clear that the fact that X; has a specific distribution plays an in- 
significant role here: properties (i)—(iv) are expected to hold in a wide range 
of situations. In fact, each of them develops into a recognised direction of 
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statistical theory. Here and below, we first of all focus on property (i) and 
call an estimator 0* (= 6*(x)) of a parameter 6 unbiased if 


K90*(X) = 9, for all 6 € O. (3.2.11) 


We will also discuss properties of mean square errors. 
So, concluding this section, we summarise that for a vector X of IID real 
RVs X1,...,Xn, (I) the sample mean 


ee ee 
 —— ) X; (3.2.12) 
nm 
4= 1 


is always an unbiased estimator of the mean E.X: 


=~ 1 
EX = — EX; = EX1, (322.13) 
nr 


(X; — EX;)? (3.2.14) 


is an unbiased estimator of the variance Var_X,: 


1 ie 
z (<2’) =~) E(X;- EX1)? = E(X, — EX;)? = Var Xi, (3.2.15) 
1=1 


and (III) in the case of an unknown mean, 


—*_Sxx = 7 S0(Xi - X)? (3.2.16) 


i=1 


is an unbiased estimator of the variance Var Xj: 


1 
E (45xx) = Var X1, (3:2:17) 
n—-1 


as was shown in Worked Example 1.4.8. 
Estimators Ee /n and Sx x/(n—1) are sometimes called the sample vari- 
ances. 


Statisticians stubbornly insist that the n justifies the means. 
(From the series ‘Why they are misunderstood’. ) 
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3.3 Sufficient statistics. The factorisation criterion 


There are two kinds of statistics, the kind you look up, 
and the kind you make up. 
R.T. Stout (1886-1975), American detective-story writer. 


In general, a statistic (or asample statistic) is an arbitrary function of sample 
vector x or its random counterpart X. In the parametric setting that we have 
adopted, we call a function T of x (possibly, with vector values) a sufficient 
statistic for parameter 6 € O if the conditional distribution of random sample 
X given T(X) does not depend on @. That is, for all DC R” 


Po(X € DIT(X)) = EU(X € D)|T(X)) is the same for all9 EO. (3.3.1) 


The significance of this concept is that the sufficient statistic encapsulates 
all knowledge about sample x needed to produce a ‘good’ estimator for @. 

In Example 3.2.1, the sample mean X is a sufficient statistic for \. In 
fact, for all non-negative integer-valued vector x = (#1,...,%) € Z" with 
b4a2; = nt, the conditional probability P,(X = x|X = t) equals 


Rita ete ee eid 
P\(X =0) P,XaH ny apl ~ net Lz 


a 


* 


which does not depend on A > 0. We used here the fact that the events 
{X =x, Xt) and AX =x 
coincide (as the equation X = t holds trivially) and the fact that nX ~ 


Po(na). 


So, in general we can write 


ee. G—— <P Ga in II =I $3 i= wt) : 


Of course, nz = 5), x; is another sufficient statistic, and ¥ and nz (or their 


random counterparts X and nX) are, in a sense, equivalent (as one-to-one 
images of each other). 

Similarly, in Example 3.2.2, % is sufficient for p with a known k. Here, 
for all x € Z"” with entries x; = 0,1,...,k and the sum %;2; = nt, the 
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conditional probability P,(X = x|X = t) equals 


Nea Oo ahr Guinan 


Tak p)rk—nt 


(nt)!(nk — nt 


which does not depend on p € (0,1). As before, if Ujx; ¢ nt, then P,(X = 
x|X = t) =0 which again does not depend on p. 

Consider now Example 3.2.3 where RVs Xj,...,X, are IID and X; ~ 
N(, 07). Here, 


I) with o? known, the sufficient statistic for pu is 
m 


(III) with both pu and o? unknown, a sufficient statistic for (1,07) is 


(%.302) , 


The most efficient way to check these facts is to use the factorisation 
criterion. 

The factorisation criterion is a general statement about sufficient statis- 
tics. It states: 

T is sufficient for 6 if and only if the PMF/PGF fx(x;0) can be written 
as a product g(T(x),@)h(x) for some functions g and h. 

The proof in the discrete case, with PMF fx(x;0) = P(X = x), is 
straightforward. In fact, for the ‘if? part, assume that the above factori- 
sation holds. Then for sample vector x € V” with T(x) = t, the conditional 
probability Pg(X = x|T = t) equals 


Po(K=x,T=t)_ P(X=x)_ ___g(T(X),6)H(x) 
Po(T = t) Po(T = t) do T(X), R(X) 
xEV": T(x )=t 
aft) ht) 
g(t) = dy AC) A(X) 
xEV™: T(X)=t xEV": T(X)=t 


This does not depend on @. 
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If, on the other hand, the compatibility condition T(x) = ¢t fails (i.e. 
T(x) # t) then Po(X = x|T = t) = 0 which again does not depend on 0. A 
general formula, is 


P9(X =x|T =t) = 


xXEV": T(X)=t 


As the RHS does not depend on @, T is sufficient. 
For the ‘only if’ part of the criterion we assume that Pg(X = x|T = t) 
does not depend on @. Then, again for x € V” with T(x) =t 


fx(x) Po(X x) Po(X x|T t)Po(T = t): 


The factor Pg(X = x|T' = t) does not depend on 6; we denote it by h(x). The 
factor Pg(T = t) is then denoted by g(t,@), and we obtain the factorisation. 

In the continuous case the proof goes along similar lines (although to make 
it formally impeccable one needs some elements of measure theory). Namely, 
we write the conditional PDF fx) 7(x|t) as the ratio 


f(x; 9) _ 
fir(t; ay!) = 


and represent the PDF fr(t; 0) as the integral 


fr(t 9) = | f (0) ax 
{KER”: T(X)=t} 


over the level surface {x € R” : T(x) = t}, against the area element d(x |t) 


on this surface. Then for the ‘if’ part we again use the representation f(x) 
g(T (x), @)h(x) and arrive at the equation 
h(x) 
dxh(x) 
{XER": T(X)=t} 


fx r(xlt) = (T(x) =), 


with the RHS independent of @. For the ‘only if’? part we simply rewrite 
f(x; 0) as fx) r(x|t) fr(t; 0), where t = T(x), and set, as before 
h(x) = fxjr (xt) and g(T(x), 0) = fr(t;9@). 

The factorisation criterion means that T is a sufficient statistic when 
T(x) = T(x’) implies that the ratio f,(x;0)/fx(x’;@) is the same for all 
6 € ©. The next step is to consider a minimal sufficient statistic for which 
T(x) = T(x’) is equivalent to the fact that the ratio f(x; 0)/fx(x’; 4) is the 
same for all 8 € QO. This concept is convenient because any sufficient statistic 
is a function of the minimal one. In other words, the minimal sufficient 
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statistic has the ‘largest’ level sets where it takes a constant value, which 
represents the least amount of detail we should know about sample x. Any 
further suppression of information about the sample would result in the loss 
of sufficiency. 

In all examples below, sufficient statistics are minimal. 

The statement and the proof of the factorisation criterion (in the dis- 
crete case) has been extremely popular among the Cambridge University 
examination questions. 

The idea behind the factorisation criterion goes back to a 1925 paper 
by R.A. Fisher (1890-1962), the outstanding UK applied mathematician, 
statistician and geneticist, whose name will be often quoted in this part of 
the book. (Some authors trace the factorisation criterion to his 1912 paper.) 
The concept was further developed in Fisher’s 1934 work. An important 
role was also played by a 1935 paper by J. Neyman (1894-1981), a Polish— 
American statistician (born in Moldova, educated in the Ukraine, worked in 
Poland and the UK and later in the USA). Neyman’s name will also appear 
in this part of the book on many occasions, mainly in connection with the 
Neyman-—Pearson Lemma. See below. 


Example 3.3.1 (i) Let X; ~ U(0,0) where 6 > 0 is unknown. Then 
T(x) = max aj, x = (#1,...,2%n) € Rf, is a sufficient statistic for 0. 
(ii) Now consider X; ~ U(@,0+ 1), @€ R. Here the sample PDF 


fx(x;0) =] [1@ <a <6+1) 
= I(min x; > 0)I(maxz; < 0+1), x € R”. 


We see that the factorisation holds when we take the statistic T(x) to bea 
two-vector (min x;, max x;), function 


g((y1, 92),0) = Tyr = O)L(y2 < 0 +1) 


and h(x) = 1. Hence, the pair (min z;, max x;) is sufficient for 6. 

(iii) Let X1,...,Xn form a random sample from a Poisson distribution 
for which the value of mean @ is unknown. Find a one-dimensional sufficient 
statistic for 0. 

(Answer: T(X) = 30, Xi.) 

(iv) Assume that X; ~ N(u,07), where both » € R and o? > 0 are 
unknown. Then T(X) = (3); Xi, 50; X?) is a sufficient statistic for 0 = 


(1,07). 
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3.4 Maximum likelihood estimators 


Robin Likelihood — Prince of Liars. 
(From the series ‘Movies that never made it to the Big Screen’.) 


The concept of a maximum likelihood estimator (MLE) forms the basis of 
a powerful (and beautiful) method of constructing good estimators which 
is now universally adopted (and called the method of maximum likelihood). 
Here, we treat the PMF/PDF f,(x;6@) as a function of 9 € © depending on 
the observed sample x as a parameter. We then take the value of 6 that 
maximises this function on set O: 


n nN 


(= 0(x)) = arg max|f(x;6): 0 € O}. (3.4.1) 


In this context, f(x;@) is often called the likelihood function (the likeli- 
hood for short) for the sample x. Instead of maximising f(x;6), one often 
prefers to maximise its logarithm ¢(x;@) = In f(x;6@), which is called the 
log-likelihood function, or log-likelihood (LL) for short. The MLE is then 
defined as 


0 = arg max|¢(x;0): 0 € O]. 


The idea of an MLE was conceived in 1921 by Fisher. 

Often, the maximiser is unique (although it may lie on the boundary of 
allowed set ©). If £(x; 0) is a smooth function of 0 € ©, one could consider 
stationary points, where the first derivative vanishes 


d O 
spl 8) =0, (or pM) =0, J 1y..-4d if = (hy... 64)] 
(3.4.2) 


Of course, one has to select from the roots of equation (3.4.2) the local max- 
imisers (by checking the signs of the second derivatives or otherwise) and 
establish which of these maximisers is global. Luckily, in the examples that 
follow, the stationary point (when it exists) is always unique. When parame- 
ter set © is unbounded (say a real line) then to check that a (unique) station- 
ary point gives a global maximiser, it is enough to verify that ¢( - ;@) + —oo 
for large values of |9]. 

Finding the MLE (sometimes together with a sufficient statistic) is an- 
other hugely popular examination topic. 


Example 3.4.1 In this example we identify MLEs for some concrete mod- 
els. (i) The MLE for A when X; ~ Po(A). Here, for all x = (1,...,%) € ZY 
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with non-negative integer entries 7; € Z+, the LL 


€(x;A) = —ndA+ So alnA - S"In(ai!), A> 0. 


Differentiating with respect to yields 


a) 1 eee at = 
a aa es and pe 


Furthermore, 
OP 48 1 


So, Z gives the (global) maximum. We see that the MLE d of \ coincides 
with the sample mean. In particular, it is unbiased. 

(ii) X; ~ U (0,6),0 > 0, then 6(x) = max aj, is the MLE for 0. 

(iii) Let X; ~ U (0,4+ 1) (cf. Example 3.3.1(ii)). To find the MLE for 6, 
again look at the likelihood: 


f(x; 0) = I(maxa;-1<9<minz,), OER. 
We see that if max x; —1 < mina; (which is consistent with the assumption 
that sample x is generated by IID U(6,@ +1) RVs), we can take any value 
between max x; — 1 and minx; as the MLE for 06. 


It is not hard to guess that the unbiased MLE estimator for @ is the 
middle point 


~ 1 ; 1 i 
J= 5 (max x; -1+min z;) = 5 (max xz; + min x;) — 3 
Indeed: 
romr | _ 1 
KO = 5 (Emax X; + Emin X;) — 5 
0+1 0-41 
1 1 
= 9 dap ain (2) an dt Price x (2) = >” 
/ 0 
with 


d . d “ 
Prine) = —q, P(min Xi > xr) = er (P(X; > r)) 


= —-— Ju =n(@ti-a2)!,@<2<6+1, 


PmaxX,(2) =n(x— 0)" 1, O< ae < O41. 
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Then 


O+1 O+1 


Emin X;=n fe +1-—2)""ladz =—n fe +1—2)"dzx 


0 0 
O+1 


1 
+(0+1)n fe egy de on [ 2"de 


0 
1 


1 

1 fod dg a jee eee (2 
+(6+ yn fa da mae F 5 Sore a ) 

0 

and similarly, 
n 

E Xi, = 0 3.4.4 
max X; at ; ( ) 


giving that EO = 0. 


The MLEs have a number of handy properties: 


(i) 


(iii) 


If T is a sufficient statistic for 0, then ¢(x; 0) = In g(T(x), 6) + Inh(x). 
Maximising the likelihood in @ is then reduced to maximising func- 
tion g(T(x),@) or its logarithm. That is, the MLE 6 will be a function 
of T(x). 

Under mild conditions on the distribution of X;, the MLEs are (or can 
be chosen to be) asymptotically unbiased, as n — oo. Furthermore, an 
MLE @ is often asymptotically normal. In the scalar case, this means 
that Jno — 0) ~N(0,v), where the variance v is minimal amongst 
attainable variances of unbiased estimators for 0. 

The invariance principle for MLEs: If @ is an MLE for 0 and we pass 
from parameter 6 to 7 = u(9), where function u is one-to-one, then u(8) 
is an MLE for 7. 


3.5 Normal samples. The Fisher Theorem 


Did you hear about the statistician who was put in jail? 
He now has zero degrees of freedom. 
(From the series ‘Why they are misunderstood’. ) 


Example 3.5.1 In this example we consider the MLE for the pair (1,07) 
in IID normal samples. Given X; ~ N(j1,07), the LL is 
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é Q) a 2 1 2 
C(x; u,o7) = —=In(27) — 5m (o*) 552 d (a; — p)*, 
weal 
x=|: | €R”, pER, co’ >0, 
In 


with 
ag [,0°) = a2 des — B), 
3) 9 n 1 2 
Fon Ls bo = 5am agi D(a Nes 
The stationary point where 0¢/Ou = 0¢/00? = 0 is unique: 
a =", om = — Sra, 
n 
where, as in equation (3.2.9): 


Bea(= Seal) = ) (ar ==)" (3.5.1) 


a 


(In some texts, the notation S2,, or even S., is used, instead of S,,..) The 
point (77,67) = (%,Sz2/n) is the global maximiser. This can be seen, for 
example, because ¢(x; 1,07) goes to —oo when |u| + co and a? — ov, and 
also €(x; u,07) > —oo as o? - 0 for every ps. Then (%, Srx/n) cannot be 
a minimum (or a saddle point). Hence, it is the global maximum. Here, 
X is unbiased, but Sxx/n has a bias: ESxx/n = (n — 1)o?/n < o?. See 
equation (3.2.8) and Worked Example 1.4.8. However, as n — oo, the bias 
disappears: ESyxx/n — o?. (The unbiased estimator for 0? 
Sxx/(n—1).) 


is of course 


An important fact is the following statement, often called the Fisher The- 
orem. 

For IID normal samples, the MLE (ji,¢7) = (x, Sxx/n) is formed by 
independent RVs X and Sxx/n, with 


Xw N(u ~) (i.e. /n(X — 1) ~ N(O,o7)) 
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and 
Sxx Sx = 

xX : 

~ x2 [\ aa ) Y;? where Y; ~ N(0,1), independent : 


i=1 


The Fisher Theorem implies that the RV [Sx x — (n— 1)o?] /(o?V2n) is 
asymptotically N(0,1). Then of course, 


1 
Jn (sxx a a*) ~ N(0, 20%). 


To prove the Fisher Theorem, first write 


SOG - uw)? = (4G - X 4+ X-p)? => (X - X) 4 

i=1 
since the sum )77_,(X; — X)(X — pw) = (X — p) 0, (Xi — X) = 0. In other 
words, )>,(X; — w)? = Sxx + n(X — p)?. 

Then use the general fact that if vector 


Xy 1 


has IID entries X;— js ~ N(0, 07), then for any real orthogonal n x n matrix 


A, vector 
Y, 
: | = A(X pl) 
Yn 
has again IID components Y; ~ N(0,07) (see Worked Example 2.3.5). 
We take any orthogonal A with the first column 
1//n 
1/vn 
to construct such a matrix you simply complete this column to an orthonor- 
mal basis in R”. For instance, the family e2,...,e, will do, where column 


e; has its first k — 1 entries 1/,/k(k — 1) followed by —(k — 1)/\/k(k — 1) 
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and n — k entries 0: 


1/,/k(k — 1) 
rae Cay 
0 


Then 
and Yo,..., Y¥, are independent of Yj. 


Because the orthogonal matrix preserves the length of a vector, we have 
that 


n 


n n 
SV? = OG - 0)? = 0K =n)? + OG - XY = VP + Se, 
i=1 i=1 i=1 


ie. Sxx = 1, Y?. Then Sxx/o? = 372. Y7/0? ~ x2_}, independently 
of Yj. 
Remark 3.5.2 Some authors call the statistic 


Sxx 
n—1l 


8Sxx = (3.5.2) 


the sample standard deviation. The term the standard error is often used for 
sxx//n which is an estimator of o/\/n. We will follow this tradition. 


Statisticians do all the standard deviations. 
Statisticians do all the standard errors. 
(From the series ‘How they do it’.) 


3.6 Mean square errors. The Rao—Blackwell Theorem. 
The Cramér—Rao inequality 


Statistics show that of those who contract 
the habit of eating, very few survive. 
W. Irwin (1876-1959), American editor and writer. 


When we assess the quality of an estimator 6* of a parameter 0, it is useful 
to consider the mean square error (MSE) defined as 


E9(0*(X) — 6)?; (3.6.1) 
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for an unbiased estimator, this gives the variance Varg 6*(X). In general, 


E9(0*(X) — 0)” = Eo (6*(X) — Eg6*(X) + Eg6*(X) — 8)? 
= Eg(6*(X) — Ey6*(X))? + (E96*(X) — 6)? 
+ 2(Eg6"(X) — @)E9(0"(X) — Egd*(X)) 
= Varg 0*(X) + (BiasgO*(X))*, (3.6.2) 


where Biasg6* = Eg6* — 0. 

In general, there is a simple way to decrease the MSE of a given estimator. 
It is to use the Rao—Blackwell (RB) Theorem: 

If T is a sufficient statistic and 0* an estimator for @ then ee = 
K(O*|T) has 


£(0* — 0)? < E(0* —0)?,0€0. (3.6.3) 


Moreover, if E(0*)? < oo for some 6 € ©, then, for this 0, the inequality is 
strict unless 0* is a function of T. 

The proof is short: as EO* = E E(0*|T')] = E6*, both 6* and 6* have the 
same bias. By the conditional variance formula 


ara! 


Var 0* = E [Var(6*|T)] + Var [E(6*|T)] = E[Var(6"|T)] + Var 6°. 


Hence, Var 0* > Var 6* and so E(o* — 6)? > E(0* — 6)*. The equality is 
attained if and only if Var(6*|T) = 0. 


Remark 3.6.1 (1) The quantity E(6*|T) depends on a value of 6*(x) but 
not on 6. Thus 6* is correctly defined. 

(2) If 6* is unbiased, then so is 0*. 

(3) If 6* is itself a function of T, then O* = 6%. 


The RB Theorem bears the names of two distinguished academics. David 
Harold Blackwell (1919-2010) was a US mathematician, a leading proponent 
of a game theoretical approach in statistics and other disciplines. Blackwell 
was one of the first African-American mathematicians to be employed by a 
leading university in the USA and the first one to be elected into the National 
Academy of Sciences of the USA. Blackwell was well-known for his ability 
to find new insights in diverse areas of mathematics. He was a dedicated 
lecturer and teacher at all levels of mathematics (65 Ph.D. students in total), 
with a subtle sense of humour. (A legend says that he did most of his teaching 
by talking about balls and boxes.) His introductory lecture filmed by the 
American Mathematical Society was called “Guessing at Random”. Being a 
prominent black member of the academic community, Blackwell was a réle 
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model and an ambassador of values of education, travelling across many 
underdeveloped countries. He was also an avid follower of field and track 
events in athletics and a regular spectator of tournaments around America. 
During their long-lasting marriage, David and his wife Ann Madison raised 
eight children. 

Calyampudi Radhakrishna Rao (1920-—) is an Indian mathematician who 
studied in India and Britain (he took his Ph.D. at Cambridge University 
and was Fisher’s only formal Ph.D. student in statistics), worked for a long 
time in India and currently lives and works in the USA. Later in this book 
we will talk about him in more detail. 


Example 3.6.2 (i) Let X; ~ U(0,0). Then 6* = 2X = 25>, X;/n is an 
unbiased estimator for 0, with 
=, 4 0? 
Var(2X) = — Var X; = —. 
ar(2X) |, Var X1 = 3 
We know that the sufficient statistic T has the form T(X) = max; X;, with 
the PDF 


n-1 
—_I(0<# <6). 


fr(x)=n F 


Hence, 


Ox * 2 7 
=e |) = - > H (X;|T) = 2E(X,|T) 
a 
1 X; -—1 1 
=2 (max x x Mis age )-4 T 
n 2 n n 


and @* should have an MSE less than or equal to that of 0*. Surprisingly, 
giving away a lot of information about the sample leads to an improved 
MSE! In fact, the variance Var 6* = (n + 1)?(VarT')/n?, where 


6 n-1 0 n—-1 2 
Var T = i n—— ade — is n——xdz 

0 gn 0 gr 

2 
n n 

n+ 2 (= :) 
So Var 6* = 6? /[n(n + 2)] which is < 6?/3n for n > 2 and goes faster to 0 
as 2 —> OO. 


(ii) Let X; ~ U(0,0+1). Then 0 = S>; Xi/n — § is an unbiased estimator 
for 0; here 


2 nr 
(n+ 1)?(n +2)" 


Q 


=. A 1 
Var X = — Var X; = ——. 
n 12n 
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This form of the estimator emerges when we equate the value 6+] /2 with 
sample mean )~;"_, X;/n. Such a trick forms the basis of the so-called method 
of moments in statistics. The method of moments was popular in the past 
but presently has been superseded by the method of maximum likelihood. 

We know that the sufficient statistic T has the form (min; X;, max; X;). 
Hence, 


x 1 1 
OO) ei E(X;|1) — = 
Or) = TOBIN) ~ 5 
Mahe 

= E(X,|T) - a 3 (min Xi + max X; — 1). 


The estimator 6* is unbiased: 


abt = > ( LAE Oy, Sg 1) =6 
2\n+1 n+l 


and it should have an MSE less than or equal to that of 6*. Again, giving 
away excessive information about X leads to a lesser MSE. In fact, the 


~ 1 
variance Var 6* = Z Var(min X; + max X;) equals 


: 1 
a (min X; + max X;)* — a\ Emin X; + Emax X;)° 
641041 
: 1 
a 4 / / Pity Ree ES + y)*dyda os 7 (28 ae ier 
6 


See equations (3.4.3) and (3.4.4). Here, 


2 
Frain Rema ASO) == ~ ded, = z) 


=n(n-l1)\(y—2)"*, O0<a<y< O41. 
Writing 


O+1 p41 
r= ff yar (e+ uPayae, 
0 x 


we have 


n2+3n+4 
A(n + 1)(n + 2) 


cn(n—1)I= 6? +8 
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~ n?+3n+4 1 1 1 a 
Var 0* = = < =WVarx. 
ae” Ant Dint2) 4 Wn+D(n+2)~ n°” 


Indeed, the above integral J equals 


0+1 041 
[ay 0) + doly — 2) + 40°Iayde 
7] x 
6+1 
/ —— Ax(O + 1-2)" eee 
= + dx 
n+1 n n—1 


1 
~ (n—In(n+ I(r +2) 
+ A(n — 1) + 46?(n + 1)(n +2) + 80(n 4 2) +8]. 


[n(n 1) + 40(n —1)(n +2) 


n?+3n+4 
A(n + 1)(n + 2)’ 


1 
grin -II=0 +04 


as claimed. 


Example 3.6.3 Suppose that X1,...,X, are independent RVs uniformly 
distributed over (8,20). Find a two-dimensional sufficient statistic T(X) for 
6. Show that an unbiased estimator of 6 is 0 = 2X1/3. 

Find an unbiased estimator of @ which is a function of T(X) and whose 
mean square error is no more that of 0. 

Here, the likelihood function is 


n 
1 1 
f(X;0) = II rG eG | gant (min 2; aon max aj < 20), 
i=1 
and hence, by the factorisation criterion 


T = (min X;, max X;) 
v v 


is sufficient. Clearly, EX, = 30/2, so 0* = 2X,/3 is an unbiased estimator. 
Define 


2 
O=E @a min X; =a, max Xji= ) 


2a eUia ee Gone are 
3n | 3n ne? 3 Bs 7-8 
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In fact, X1 equals a or 6 with probability 1/n each; otherwise (when X, # a,b 
which holds with probability (n — 2)/n) the conditional expectation of X1 
equals (a + b)/2 because of the symmetry. 

Consequently, by the RB Theorem, 


1 
a (nin X; + max x;) 
3 i i 


is the required estimator. 


We would like of course to have an estimator with a minimal MSE (a 
minimum MSE estimator). An effective tool to find such estimators is given 
by the Cramér—Rao (CR) inequality, or CR bound. 

Assume that a PDF/PMF f(-;0) depends smoothly on parameter 0 and 
the following condition holds: for all 9 € O 


a) a) 
— ° = — © = ‘i s A 
| pple Pee oor > Sflas6) = 0 (3.6.4) 
xEVv 
Consider IID observations X1,...,Xn, with joint PDF/PMF f(x;@) = 
f(a1;0)-++ f(anj; 0). Take an unbiased estimator 6*(X) of 0 satisfying the 
condition: for all@ € O 


#15) 2 gare 
i 0° (x) aa f(x; O)dx = ye (x) pF: 9) = 1, (3.6.5) 


Then for any such estimator, the following bound holds: 


1 


Soo 
Var 0 = TA’ 


(3.6.6) 


where 


A(@) = / Flas 8) da or S> fea (3.6.7) 


xrEV 


The quantity A(@) is often called the Fisher information and features in 
many areas of probability theory and statistics. 


Remark 3.6.4 In practice, condition (3.6.4) means that we can inter- 
change the derivation 0/00 and the integration/summation in the (trivial) 
equality 


O O 
5 f f(a 8)dx =0 or 59 de f(a) =0. 


xrEV 
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The equality holds as 
[ te oae =1.0r Ss f@ 6) =1. 
rEeV 


A sufficient condition for such an interchange is that 


which is often assumed by default. Similarly, equation (3.6.5) means that we 
can interchange the derivation 0/06 and the integration/summation in the 
equality 


Pao 2 aot X,Y, 


as 


E0*(X) = - _O (x) F(x dx or SS” 6 (x) f(x; 8). 


xeEy” 


Observe that the derivative Of (x; @)/00 can be written as 


Fe) “hea = 108) » ee (3.6.8) 


We will prove bound (3.6.6) in the continuous case only (the proof in the 
discrete case simply requires the integrals to be replaced by sums). Set 


Dad) = =m f(a; 9). 


By condition (3.6.4), 
[te 0) D(x, 0)dx = 0, 


and we have that for all 0c O 
. 0 
| t6s 0) 2 Pies = nf 5g f(t O)dx = 0. 


On the other hand, by virtue of (3.6.8) equation (3.6.5) has the form 


n 


0° (x) f(x; 0) $” D(a, 0)dx = 1,0 € ©. 
R’ i=l 
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The two last equations imply that 


a (9*(x) — 8) 5 D(ai.8) fl d)dx =1, 0€@. 
i=1 


In what follows we omit subscript R” in n-dimensional integrals in dx 
(integrals in dx are of course one-dimensional). We interpret the last equality 
as (91,92) = 1, where 


ape i u(x) go (x)n(x)dx 


is a scalar product, with 
g(x) = 6*(x)— 0, golx = Yoni x;,0) and n(x) = f(x;0) >0 


(n(x) is the weight function determining the scalar product). 
Now we use the Cauchy-Schwarz (CS) inequality ((gi,92) 2 & 
(91,91) (92,92). We obtain thus 


1s | [ere ose 0x] 5 f > Die0) “fles Ode 


Here, the first factor gives Var 6*(X). The second factor will give precisely 
nA(@). In fact: 


/ (> bion0)) 5 (x; 0)dx = yf x;,0)* f(x; 0)dx 
i=1 
450 Ss" fa 2;,,9)D(ain, 0) f (x; 0) dx. 


1<01<t2<n 


Each term in the first sum gives A(6): 
[ Pc? floc: Bjax = f D(x,6)? flaO)dx, 


while each term in the second sum gives zero, as { D(x, 6) f(x; 0@)dz = 
2 


/ DG OIG OIG Aae= / D(a, 6) f(a;0)de| =. 


This completes the proof. 
To conclude this section, we give a short account of Rao’s stay in Cam- 


bridge. Rao arrived in Cambridge in 1945 to work in the University Mu- 
seum of Archaeology and Anthropology on analysing objects (human skulls 
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and bones) brought back by a British expedition from a thousand year old 
site in North Africa. He had to measure them carefully and apply what is 
called the Mahalanobis distance to derive conclusions about the anthropo- 
logical characteristics of an ancient population. Rao was then 25 years old 
and had 18 papers in statistics published or accepted for publication. His 
first paper, which had just been published, contained both the RB Theorem 
and CR inequality. Soon after his arrival in Cambridge he met Fisher and 
began also working in Fisher’s Laboratory of Genetics on various character- 
istics of mice; he had to breed them, mate them in a determined way and 
record genetic parameters of the litter produced (kinky tails, ruffled hair 
and a disposition to keep shaking all the time). As is described in Rao’s 
biography [86], his final duty in each experiment was to dispose of mice not 
needed in further work; according to the customs of the time, young mice 
were put in ether and mature ones had their heads hit against a table (a 
practice that would nowadays undoubtedly generate an objection from An- 
imal Rights activists). Rao was too sensitive to do this particular job and 
had some friends who did it for him; otherwise he utterly enjoyed his work 
(and the rest of his time) in Cambridge. Another of his friends was Abdus 
Salam, the future Nobel Prize winner in physics and then a student at St 
John’s College. Salam had doubts about his future in research and, seeing 
Rao’s determination, expressed keen interests in statistics. However, it was 
Rao who persuaded him not to change his field .... 

The work in Fisher’s Laboratory formed the basis of Rao’s Ph.D. the- 
sis which he passed successfully in 1948, by which time he had 40 pub- 
lished papers. After receiving his Ph.D. from such a prestigious university 
as Cambridge, it was supposed that back in India he would receive offers 
of matrimonial alliances from many rich families. But a month after return- 
ing home from Cambridge he became engaged to his future wife, who was 
then 23 and had her own academic degrees. The marriage was arranged by 
Rao’s mother who was very progressive: she did not mind him marrying a 
highly educated woman, although at that time such brides were generally 
not wanted in families with eligible sons. Their marriage has been perfectly 
happy, which makes one wonder why the (completely unarranged) marriages 
of other famous statisticians in Europe and America (including ones repeat- 
edly mentioned in this book) ended badly. 

Rao’s contributions in statistics are now widely recognised. It was not so in 
the beginning, particularly with the RB Theorem. Even in 1953, eight years 
after Rao’s paper and five years after Blackwell’s, some statisticians were 
referring to the procedure described in the theorem as ‘Blackwellisation’. 
When Rao pointed out that he was the first to discover the result, a lecturer 
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on this topic said that this term is easier on the tongue than ‘Raoisation’. 
However, in a later paper the statistician in question proposed the term 
‘Rao—Blackwellisation’ that is now used. On the issue of the CR inequality 
(the term proposed by Neyman), Rao remembers a call from an airline 
employee at Tehran airport: ‘Good news, Mr Cramer Rao, we found your 
bag’, after a piece of his luggage was lost on a flight. 

Rao likes using humour in serious situations. In India, as in many coun- 
tries, birth control is an important issue, and providing women with reasons 
not to have too many children is one of the perennial tasks of local and 
central administration. In one of his articles on this topic, Rao points out 
that every fourth baby born in the world is Chinese and then makes the fol- 
lowing statement to his Indian audience: ‘Look before you leap to your next 
baby, if you already have three. The fourth will be a Chinese!’ Hopefully, 
some readers of this book will find this instructive in dealing with sample 
means .... 


3.7 Exponential families 
Sex, Lies and Exponential Families. 


(From the series ‘Movies that never made it to the Big Screen’.) 


It is interesting to investigate when the equality in CR bound (3.6.6) is 
attained. Here again, the CS inequality is crucial. We know that for equality 
we need functions g; and gz to be linearly dependent: gi(x) = Ago(x). Then 


[ sogeorn(a9 x = af g2(x)?n(x)dx. 


In our case, [ 91 (x)g2(x)n(x)dx = 1 and so 


a | / oas)?ntwde] = aay. 


Thus, we obtain the relation: 


or 
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The LHS of the last equation does not depend on 6. Hence, each term 
6 + D(a;,0)/A(@) should be independent of 6: 


D(a,0) _ 
In other words, 
D(a, 0) = A(6@)(C(x) — 8), EO, (3.7.1) 


and the estimator 6* has the summatory form: 
1 n 
(x) =— C(a;). 3.12 
=) Lew) (3.7.2) 
Now solving equation (3.7.1): 


O 
agit f(x; 0) = A(A) [C(x) — 4, 


where A(@) = B’ (8), we obtain 
In f(a; 0) = B’(6)[C(x) — 6] + B(@) + H(z). 
Hence, 
f(z; 0) = exp [B’ (0) (C(x) — 6] + B(@) + H(z)| . (3:7.3) 

Such families (of PDFs or PMFs) are called exponential. 

Therefore, the following statement holds. 

Equality in the CR inequality is attained if and only if the family { f(x; 6)} 
is exponential, i.e. is given by (3.7.3), where B’(@) > 0. In this case the 


minimum MSE estimator of 0 is given by (3.7.2) and its variance equals 
1/(nB"(0)). Thus the Fisher information equals B’ (0). 


Example 3.7.1 Let X; be IID N(,07) with a given o? and unknown p. 


Write c = —[In(2707)]/2 (which is a constant as o? is fixed). Then 
(x — p)? Mea) 5, je Be 
Inf(ejp) =“ ea be. 
aie) ape gu age age 


We see that the family of PDFs f(- ; uz) is exponential; in this case: 
P) 


C(z)=2, Bw = 45, Bw =F 


202? a2’ 
jj 1 x? 
A(u) = BY(u) = ae? (a) = ee 


Note that here A(z) does not depend on yp. 
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Hence, in the class of unbiased estimators 6*(x) such that 


*( (x; — LL) (oy a2 p)? = 
foes o2(2no2)n/2 exp X d52 dx = 1, 
the minimum MSE estimator for yu is 
1 
v 


the sample mean, with Var X = o?/n. 


Example 3.7.2 If we now assume that X; is IID N(,07) with a given pu 
and unknown o?, then again the family of PDFs f(- ;07) is exponential: 


In f(@;07) = sal(z i)” — 0”) 5 In(2n0) -_ * 
with 
O(a) = (e- 4)?, Blo?) = - 5 n(o?), BY(o?) =~ 55, 


A(o?) = BY(0?) = 5 (4) 


and H(x) = —[ln(27) + 1]/2. We conclude that, in the class of unbiased 
estimators 6*(x) such that 


pi)? = 0% Ce 
yf oe oe Ton" ae se 20? ane 
j 
the minimum MSE estimator for o? is ©?/n, where 


»? = we ca he 


a 


with Var(=?/n) = 204/n. 


Example 3.7.3. The Poisson family, with f\(k) = e~*A*/k!, is also expo- 
nential: 


In fy(k) = klnA — In(kAl) — A = (Kk — A) In A+ A(InA — 1) — In(&I), 


with 


COD=k. BOL Nini =1). BOS ANSE = x 
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and H(k) = —In(k!). Thus the minimum MSE estimator is 


ee 


Tee 4, with VE eee 
a n 


2 


We leave it to the reader to determine in what class the minimum MSE is 


guaranteed. 


Example 3.7.4 The family where RVs X; ~ Exp(A) is exponential, but 
relative to 9 = 1/X (which is not surprising as EX; = 1/A). In fact, here 


ee {exp | 5 (a 8) —Ino if bre > 0), 
with 
—5 AO) = B'(0) = 5 


Therefore, in the class of unbiased estimators 6*(x) of 1/A, with 


C(z) =2, B(6)=—Iné, B’(0) = Ala) ==]; 


S- 6*(x)(A?a2; — A)A” exp | —r X €;- dx =, 


i-1 VRE 


the sample mean % = )°>, x;/n is the minimum MSE estimator, and it has 


Var X = 6?/n =1/(nd?). 


An important property is that for exponential families the minimum MSE 
estimator 0*(x) coincides with the MLE 6(x), i.e 


os ~ Yew. (3.7.4) 


More precisely, we write the stationarity equation as 


= yop ai, 0) f(x; 0) =0, 


which, under the condition that f(x; 6) > 0, becomes 


n 


S> D(ai,9) = 0. (3.7.5) 


i=1 
In the case of an exponential family, with 


f(a;0) = exp (B’(@)[C(z) —6|+ B(6) + H(z)) : 
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we have 


DEO )\= =m f(a; 0) = B’(0)(C(z) — 8). 


We see that if 0* = 6*(x) = )“y_, C(a;)/n, then 


D(ai, 6") = B"(6")[C(xi) — 6] = B"(6") | C(ai) — > C(x) | , 
3. 
and hence 


n n 


5 Dep 0 SBME) SC ~ = o(;) = 0. 


i=l i=l j 


Thus minimum MSE estimator 6* solves the stationarity equation. There- 
fore, if an exponential family { f(x; @)} is such that any solution to stationar- 
ity equation (3.4.2) gives a global likelihood maximum, then 6* is the MLE. 

The CR inequality is named after C.H. Cramér (1893-1985), a promi- 
nent Swedish analyst, number theorist, probabilist and statistician, and 
C.R. Rao. One story is that the final form of the inequality was proved 
by Rao, then a young (and inexperienced) lecturer at the Indian Statistical 
Institute, overnight in 1943 in response to a student enquiry about some 
unclear places in his presentation. 


3.8 Confidence intervals 


Statisticians do it with 95 percent confidence. 
(From the series ‘How they do it’.) 


So far we developed ideas related to point estimation. Another useful idea 
is to consider interval estimation. Here, one works with confidence intervals 
(CIs) (in the case of a vector parameter @ € R4, confidence sets like squares, 
cubes, circles or ellipses, balls etc.). Given y € (0,1), a 1007 percent CI for 
a scalar parameter 6 € O C R is any pair of functions a(x) and b(x) such 
that for all 8 € © the probability 


Po(a(X) <6 < W(X)) =. (3.8.1) 


We want to stress that: (i) the randomness here is related to endpoints 
a(x) < 6(x), not to 0, and (ii) a and b should not depend on 6. A CI may be 
chosen in different ways; naturally, one is interested in ‘shortest’ intervals. 
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Example 3.8.1 The first standard example is a CI for the unknown mean 
of a normal distribution with a known variance. Assume that X; ~ N(j, 07), 
where o? is known and p € R has to be estimated. We want to find a 
99 percent CI for yu. We know that /n(X — p)/o ~ N(0,1). Therefore, if 
we take z_ < z; such that ®(z,) — ®(z_) = 0.99, then the equation 


P Gc < Miler bb) < x1) = 0.99 
oO 
can be rewritten as 


— Z40 => ZO 
P(X ==] <p< X= — ) =0.99, 
( va =) 


i.e. gives the interval 


centred at X — (z_ + 24)o0/(2\/n) and of width (z_ — z_)o//n. 

We still have a choice of z_ and z,; to obtain the shortest interval we 
would like to choose z, = —z_ := z, as the N(0,1) PDF is symmetric and 
has its peak at the origin. Then the interval becomes 


xX - X+ 2), 
( ae 
and z will be the upper 0.005 point of the standard normal distribution, with 


®(z) = 1 — 0.005 = 0.995. From the normal percentage tables: z = 2.5758. 
Hence, the answer: 


— 2.57580 — 2.57580 
(x - ee x ae ), 
Example 3.8.2 The next example is to determine the CI for the unknown 
variance of a normal distribution with a known mean. Assume that X; ~ 
N(, 07), where yz is known and a? > 0 has to be estimated. We want to find 
a 98 percent CI for 0”. Then 5),(X;—p)?/o? ~ x7,. Denote by F,2 the CDF 
P(X <x) ofa RV X ~ x2. Take h~ < ht such that F,2 2(ht) — Fya(h-) = 
0.98. Then the condition 


P G < 4 (mix = “) Fa ) = 0.98 


can be rewritten as 
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This gives the interval 


(zs x — 1)’, Bee? = “) . 


u 
Again we have a choice of h~ and ht; a symmetric, or equal-tailed, option 
is to take Fi2(h~) = 1— Fa(ht) = (1 — 0.98)/2 = 0.01. From the x? 
percentage tables, for n = 38,h~ = 20.69,h* = 61.16. See, for example, 
[94], pp. 40-41 (this is the standard reference to statistical tables used in a 


number of examples and problems below). 


In Examples 3.8.1 and 3.8.2 we managed to find an RV Z = Z(T(X), 80) 
which is a function of a sufficient statistic and the unknown parameter 0 (1 
in Example 3.8.1 and o? in Example 3.8.2) and also has a distribution not 
depending on this parameter. Namely, 


Z = Vn(X — )/o ~ N(0,1) in Example 3.8.1 


and 


Z= SOX —p)?/o? ~ x? in Example 3.8.2. 
i 


We then produced values yz such that P(y < Z < y+) = 7 y(z4) in 
Example 3.8.1 and h* in Example 3.8.2 and solved the equations Z7(T, 0) = 
y+ to find roots 


a(X) = a(T(X,y_)) and o(X) = o(7(X,y-)). 


The last step is not always straightforward, in which case various approxi- 
mations may be useful. 


Example 3.8.3 In this (more challenging) example the above idea is, in 
a sense, pushed to its limits. Suppose X; ~ Po(A), and we want to find a 
1007 percent CI for \. Here we know that nX ~ Po(nA), which still depends 
on A. The CDF F = Fy for X jumps at points k/n, k = 0,1,..., and has 
the form 
|nal . 
Pe x). = Te 0) S° err DA) (3.8.2) 


! 
= Tr: 


where, for y > 0, 


ie y—1, yis an integer, 
ir [y], the integer part of y, if y is not integer. 
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It is differentiable and monotone decreasing in A: 


nr)k-1 
# F(k/n,d) = me) DI 


Dr <0. 


Thus if we look for bounds a < A < 8, it is equivalent to the functional 
inequalities F(a,b) < F(a;X) < F(ax;a), for all x > 0 of the form k/n, 
k = 1,2,.... We want a and 6 to be functions of x, more precisely, of the 
sample mean Z. The symmetric, or equal-tailed, 1007 percent CI is with 
endpoints a(X) and b(X) such that 


P(A < a(X)) = P(A> B(X)) = ae 


Write 
P(\ > b(X)) = P(F(X; A) < F(X;0(X)) 
and use the following fact (a modification of formula (2.1.57)): 
If, for an RV X with CDF F and a function g : R > R, there exists 


a unique point c € R such that F(c) = g(c), F(y) < gly) for y < ¢ and 
F(y) = g(y) for y = ¢, then 


P(F(X) < 9(X)) = glo). (3.8.3) 
Next, by equation (3.8.2): 
k 


P(F(X;)) < F(X;0(X))) =F (é b (=)) 


provided that there exists k such that F(k/n; A) = F(k/n;b(k/n)). In other 
words, if you choose b = b(%) so that 


ere = —_, (3.8.4) 
1=0 : 


it will guarantee that P(A > b(X)) < (1—)/2. 
Similarly, choosing a = a(Z) so that 


= na l i 
S- enna l a) a (3.8.5) 


l=nz 


will guarantee that P(A < a(X)) < (1 — y)/2. Hence, (a(X),b(X)) is the 
(equal-tailed) 1007 percent CI for 4. 
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The distributions of a(X) and b(X) can be identified in terms of quantiles 
of the x? distribution. In fact, for all k = 0,1,..., 
d ey 3! git 
sr | 
k<l<oo 


ii te § 
k<l<oo : 


k-1 


Ci * | 


=2 2 
(k= at ~ 2A (28), 
where Yj ~ x3,. That is, Rees. e~*s!/l! = P(Y, < 2s). We see that 


2na = hp, ((1—)/2); 
the lower (1 — y)/2-quantile of the y?-distribution with m~ = 2nX degrees 
of freedom. 
Similarly: 


I i I-1 k 
d =o =@8 te Nae e ete 
ae e Th e il +e =-e 
0<l<k , 0<I<k : 


kl = 2 fy, (2s) 
for Yg ~ Rents: That is, 


It means that 


anb = he, ((1—»)/2), 
the upper ((1 + y)/2)-quantile of the y? distribution with m+ = 2nX +2 
degrees of freedom. 
These answers look cumbersome. However, for large n, we can think that 
X ~ N(A,A/n), ie. V/n/A(X — A) ~ N(0,1). 
Then if we take y = 0.99 and, as before, —a_ = ay = 2.5758, we have that 


P (0. < ax ae ay] = 0.99. 


We can solve this equation for \ (or rather VA). In fact, (X — »)/VX 
decreases with A, and from the equation 


X= _ ae 
VX n 
we find: 
2 
aL = a+ at pam at 
Ax = —~+ X = Ax = =+X 
x an * 2/n’ ae An 
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(the negative roots have to be discarded). Hence, we obtain that the prob- 
ability 


2 2 
257582 2. 2.57582 — 2.5758 
. ( 5758? me) <a<( 5758? 


An 2/n An 2/n 


is = 0.99. Hence, 


2 2 
ZDTOS? «= DST 251582 ==; , 20108 
+ X ; +X 4 
An 2/n An 2/n 
gives an approximate answer whose accuracy increases with n. 
Confidence intervals for the mean of a Poisson distribution attracted par- 
ticular attention in many books, beginning with [108]. In this book, the term 


‘confidence belt’ is used, to stress that the data serve a range of values of 
both n and X. 


3.9 Bayesian estimation 


Bayesian Instinct. 
Trading Priors. 
(From the series ‘Movies that never made it to the Big Screen’.) 


A useful alternative to the above version of the estimation theory is where 
6 is treated as a random variable with values in O and some (given) prior 
PDF or PMF z(6). After observing a sample x, we can produce the posterior 
PDF or PMF z(6|x). Owing to Bayes’ Theorem, 7(0|x) is defined by 


1 (O|x) x 1() f(x; 6). (3.9.1) 
More precisely, 


Oe aw) f(x: 0), (3.9.2) 


where 
f(x) = | 7(0) f(x; 0)d0 or )_ 2(6) f(x; 4). 
ip > 


Pictorially speaking, (posterior) « (prior) x (likelihood), where the constant 
of proportionality is simply chosen to normalise the total mass to 1. 
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Remark 3.9.1 Note that the likelihood f(x;@) in equation (3.9.1) and 
(3.9.2) is considered as a conditional PDF/PMF of X, given 0. This is the 
Bayesian interpretation of the likelihood, as opposed to the Fisherian inter- 
pretation where it is considered as a function of 0 for fixed x. 


In Examples 3.9.2-3.9.5 we calculate posterior distributions in some mod- 
els. 


Example 3.9.2 Let X; ~ Bin(m,@) and the prior distribution for 6 is 
Bet(a, b) for some known a, b: 


n(0) « 67 1(1 — 6) 1110 < 8 < 1); 
see Example 3.1.5. Then the posterior is 
m(O|x) x OU MtA-1(y _ gyrm—Liaitb-1 719 < 9 <1), 


which is Bet ()>, 2; + a,nm — 55, 2; + 6) = Bet(n% + a,n(m—Z) +b), with 
the proportionality constant 1/B(n% + a,n(m — Z) + b). In other words, 
a Beta prior generates a Beta posterior. One says that the Beta family is 
conjugate for binomial samples. 


The Unbelievable Conjugacy of Beta. 
(From the series ‘Movies that never made it to the Big Screen’.) 


Example 3.9.3. The Beta family is also conjugate for negative binomial 
samples, where X; ~ NegBin(r,@), with known r. In fact, here the posterior 


(0|x) x“ ea al = Gere 


i.e. is Bet(nr + a, nz + b). 


Example 3.9.4 Another popular example of a conjugate family is Gamma, 
for Poisson or exponential samples. Indeed, if X; ~ Po(A) and 7(A) « 
AT~le-A/@_ then the posterior 


m(A|x) & \retT-1,-Alant1)/a 


has the PDF Gam (rt + nz, (na + 1)/a). 
Similarly, if X; ~ Exp(A) and 7 (A) « \7~te~*®, then the posterior 


(Ax) ee \itt—-le-A(nz+a) 


again has the PDF Gam (7 +n, n+ a). 
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Example 3.9.5 The normal distributions also emerge as a conjugate fam- 
ily. Namely, assume that X; ~N(j,07) where o? is known, and r(p) « 
exp [—(ju — a)?/(2b*)], for some known a € R and b > 0. Then, by using the 
solutions to Worked Examples 2.3.10 and 2.3.11, posterior 7 (|x) is 


oe (eae 2,(& , n& ee ey 
OSCR 5 oo ipo} Vast oo iP | CORD ape a)*|, 


where 
2 1 i _42f% , nF 
bj = pt op and a, = bj roRaner | . 


A further step is to introduce a loss function (LF) measuring the loss 


incurred in estimating a given parameter 9. This is a function L(6,a), 6,a € 

0, where 6 is the true and a is the guessed value. For instance, a quadratic 

error LF is L(0,a) = (9 — a)’, an absolute error LF is L(0,a) = |@ — al etc. 
We then consider the posterior expected loss 


Aika). = [ rovoze. a)dé@ or S/ (|x) L(9, a), (3.9.3) 


dcO 


while guessing value a. We want to choose @ = a(x) minimising R(x, a): 
@=argmin R(x,a). (3.9.4) 
a 


The minimiser, @, is called an optimal Bayes estimator, or optimal es- 
timator for short. (Some authors say ‘optimal point estimator’.) For the 
quadratic loss, 


Rea / x(0|x)( — a)2d8. 
1S) 
By differentiation, R(x,a) is minimised at 
a i. on(Olx)dd, 
(°) 


i.e. at the posterior mean E(@|x). Furthermore, the minimal value of the 


posterior expected loss is 


min R(x,a) = R(x,@) = / (0 — E(6]x)]2x(Alx)aa, 


which equals E[(@ — E(6|x))?|x], the posterior variance. For the absolute 
error loss, 


Rox.) = f m(69I0- a\ao = f 


—co 


a 


1(O|x)(a — 0)d0 + ie 1(0|x)(@ — a)dé 
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which is minimised at the posterior median, i.e. the value a for which 


[ r(6|x)d0 = 7 7(O|x)d0 = 1/2. 


—co a 


A straightforward but important remark is that, in general, a(x) also 
minimises the unconditional expected loss among all estimators d: x > O. 
Here, it is instructive to slightly change the terminology: an estimator is 
considered as a decision rule (you observe x and decide that the value of the 
parameter is d(x)). Given a value @ and a decision rule d, the quantity 


y L(9, d(x)) F(x; 8), 
J £(8, d(x)) f(x; @)dx, 


r(0,d) = EgL (0, d(X)) = 


represents the risk under decision rule d when the parameter value is 0. We 
want to minimise the Bayes risk 


rB => r TT or r T * oJ. 
(a) = [ r(@,a)n(o)a0 oF S(O, d)x(0) (3.9.5) 


(O) 
The remark is that 
a= arg min rB(d). (3.9.6) 


For that reason, the optimal Bayes estimator is also called the optimal rule. 

Formally, in equation (3.9.4) you minimise for every given x while in 
(3.9.6) you minimise the sum or the integral over all values of x. It is impor- 
tant to see that both procedures lead to the same answer. The above remark 
asserts the minimiser in equation (3.9.4) yields the minimum of r®(d). But 
what about the inverse statement that every decision rule minimising r®(d) 
coincides with the optimal Bayes estimator? This is also true: by changing 
the order of summation/integration over x and 0, write 


r(0, d) = E.(6|(x)) L(9, d(x)) f (x)dx or SE x(6|x)L(8, d(x)) F(x). (3.9.7) 


x 


Here f(x) is the marginal PDF/PMF of X: 


Fx) = > n(0) f(x: 0) = ff x(0) F<: 8)a0, 
[S) 
and 7(6|x) is posterior PDF/PMF of @ for given x. 
Because f(x) > 0, the minimum in d of the sum/integral on the RHS 
of equation (3.9.7) can only be achieved when summands or values of the 


298 Parameter estimation 


integrand E,(6\x)L(0, d(x)) attain their minima in d. But this exactly means 
that the minimising decision rule equals a. 
In Examples 3.9.6—3.9.8 we calculate Bayes estimators in some models. 


Example 3.9.6 A standard example is where X; ~ N(,07), 0? known 
and the prior for pis N(0, 77), with known 7? > 0. For the posterior, we have 


2 
(|x) oc w(H) f(x; w) x exp -z3 dei = n| x exp -5| 
| Yaxi/o? \? 


1 ae 1 
x 
ee a ee ey VP njo2+7-? 


That is 


nzr? o2r? ) 


nt? + 02’ nr? + 0? 


r(ubs) ~ N( 


The mean of a normal distribution equals its median. Thus under both 
quadratic and absolute error LFs, the optimal Bayes estimator for ju is 
nET? 
nt? + 07° 
If the prior is N(v,77) with a general mean v, then the Bayes estimater 
for w under both the quadratic and absolute error loss is 


vor + nur? 


nt? + 0? 
Example 3.9.7 Next, let X; ~ Po(A), where the prior for \ is exponential, 
of known rate 7 > 0. The posterior 
AO \ oe k Maey, ar mae 

is Gam (n¥ + 1,n+ 7). Under the quadratic loss, the optimal Bayes estima- 
tor is 

nxt +1 

N+T 


On the other hand, under the absolute error loss the optimal Bayes estimator 
equals the value A > 0 for which 


(n-+r) if is Stuur _ 1 
(nz)! f Ne d\ = 5° 


3.9 Bayesian estimation 299 


Example 3.9.8 It is often assumed that the prior is uniform on a given 
interval. For example, suppose X1,...,X, are IID RVs from a distribution 
uniform on (6 —1,9+ 1), and that the prior for 6 is uniform on (a, b). Then 
the posterior is 
m(O|x) «x I(a< 6 < b)I(max 2; -—1<60< min x; +1) 
= I(aV (max z;-1) <0<bA (min z;+1)), 
where a V 3 = max [a, 6], a A 8 = mina, J]. So, the posterior is uniform 


over this interval. Then the quadratic and absolute error LFs give the same 
Bayes estimator: 


n 


1 
= gla V Gnax xi —1)+6A (min x 4+ 1)). 


Another example is where @ is uniform on (0,1) and Xj,..., Xp take two 
values, 0 and 1, with probabilities 9 and 1 — @. Here, the posterior is 


n(O|x) « 6" (1 — 9)"—"*, 
i.e. the posterior is Bet(n¥ + 1,n — n¥ + i 


So, for the quadratic loss £(0,d) = (0 — d)?, the optimal Bayes estimator 


d= [aor — 0) ae d60*(1 — 6) 


_ (t+ Din-bl(n+1)! t41 
Ces) ne es 


1(O|x) ox 7(0) 70<¢<1. 


is given by 


Here t := }°, x;, and we used the identity 


[ a™1(1 — 2)" de = (m—1)'(n — 1)!/(m+n-1)! 
0 


which is valid for all integers m and n. 


Next, we remark on another type of LF, a 0, 1-loss where L(6,a) = 1—06,a 
equals 0 when 6 = a and 1 otherwise. Such a function is natural when the 
set © of possible values of 6 is finite. For example, assume that © consists 
of two values, say 0 and 1. Let the prior probabilities be pg and p; and the 
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corresponding PMF/PDF fo and fi. The posterior probabilities 7(-x) are 


(ole = po fo(x) 7 ~ pifi(x) 
= Safi Bi GOh pote) Spice) 


and the Bayes estimator @, with values 0 and 1, should minimise the expected 


posterior loss R(x, a) = 7(1 — a|x),a = 0,1. That is 

Bi ee _ f 0, if r(0|x) > m(1|x), 

ees { 1, if 7(1|x) > 1(0|x); 
in the case 7(0|x) = 7(1|x) any choice will give the same expected posterior 
loss 1/2. In other words, 


mx) _ pifitx) > | 4, filx) > 1, _ Po (3.9.8) 


a= when = , Le. 

0 m(O|x) — pofo(x) < fox) < PA 
We see that the ‘likelihood ratio’ f;(x)/fo(x) is conspicuous here; we will 
encounter it many times in the next chapter. 


To conclude the theme of Bayesian estimation, consider the following 
model. Assume RVs Xj,..., Xn have X; ~ Po(6;): 


x 


OF 6, 
F(x; 0i) = ~e fee 0.0 Fel ere 


Here, parameter 0; is itself random, and has a CDF B (on R+ = (0,00)). 
We want to estimate 0;. A classical application is where X; is the number 
of accidents involving driver 7 in a given year. 

First, if we know B, then the estimator T;(X) of 0; minimising the mean 
square error E(T; — 6;)? does not depend on X; with j i: 


T,(X) = T(Xi). 


Here, 
T(x) = f of(e:)AB)/g(2) 
and 
g(a) = f #(0:8)AB (0). 
Substituting the Poisson PMF for f(x; 6) yields 
g(x + 1) 
T(2) = (#@ + 1)———, x =0,1.,.... 
@)=@+ yt 
Hence, 
Xi +1) 
g( Xi) 
But what if we do not know B and have to estimate it from sample (X)? 


T,(X) = (X; +1) (3.9.9) 
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A natural guess is to replace B with its sample histogram 


B(0) = Hi: X; <6}, 0>0, 


n 
jumping at integer point x by amount (x) = n,/n, where 
fy Sts ey Sa eS Oa (3.9.10) 


Then substitute (x) instead of g(x). This yields the following estimator of 
0: 


TOOH=064 Were (3.9.11) 


This idea (according to some sources, it goes back to British mathemati- 
cian A. Turing (1912-1954)) works surprisingly well. See [44], [87]. The sur- 
prise is that estimator T, uses observations X;,j #1, that have nothing 
to do with X; (apart from the fact that they have Poisson PMF and the 
same prior CDF). This observation was a starting point for the so-called em- 
pirical Bayes methodology developed by H. Robbins (1915-2001), another 
outstanding American statistician who began his career in pure mathemat- 
ics. Like J.W. Tukey, Robbins did his Ph.D. in topology. In 1941 he and R. 
Courant published their book [31], which has remained a must for anyone 
interested in mathematics until the present day. Robbins is also credited 
with a number of aphorisms and jokes (one of which is ‘Not a single good 
deed shall go unpunished’). It was Robbins who first proved rigorously the 
bounds 1/12n +1 < @(n) < 1/12n for the remainder term in Stirling’s for- 
mula; see (1.6.22). He did it in his article ‘A remark on Stirling’s formula’. 
Amer. Math. Monthly, 62 (1950), 26-29. 

Robbins used formula (3.9.11) to produce a reliable estimator of the num- 
ber Sp of accidents incurred in the next year by the no drivers who did 
not have accidents in the observed year. It is clear that 0 is an underesti- 
mate for So (as it assumes that future is the same as past). On the other 
hand, no = Diez, in; gives an overestimate, since Xo did not contribute 
to the sum >, in;. A good estimator of So is X1, the number of drivers who 
recorded a single accident. In general, (¢ + 1)ni+1 accurately estimates Sj, 
the number of accidents incurred in the future year by the n; drivers who 
recorded 7 accidents in the observed year. 

Robbins introduced the term ‘the number of unseen, or missing, species’. 
For example, one can count the number n, of words used exactly x times 
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in Shakespeare’s known literary canon, x = 1,2,...; this gives the following 
table (see [45]) 


we \ 1x 1 2 3 4 5 6 7 8 9 10 


0 14,376 4,343 2,292 1,463 1,043 837 638 519 430 364 
10 305 259 8242 223 )=«6187 181 179 130 127 128 
20 104 = 105 99 112 93 74 838 76 72 683 


So, 14376 words appeared just once, 4343 twice, etc. In addition, 
2387 words appeared more than 30 times each. The total number of dis- 
tinct words used in the canon equals 31534 (in this counting, words ‘tree’ 
and ‘trees’ count separately). The missing species here are words that Shake- 
speare knew but did not use; let their number be ng. Then the total number 
of words known to Shakespeare is, obviously, n = 31534 + ng. The length 
of the canon (the number of words counted with their multiplicities) equals 
N = 884 647. 

Assume as before that X;, the number of times word 7 appears in the 
canon, is ~ Po(@;), where 0; is random. Again, if CDF B of 6 is known, the 
posterior expected value E(6;|X; = x) of 6; gives the estimator minimising 
the mean square error and equals 


1 
(x+ ya 1? (3.9.12) 
g(x) 
where 
save “e "a B(0), ee 
If B is unknown, we substitute the (still unknown) histogram 
“a 1 
BOS se? Oy <0). <8 S 0. 
n 
The estimator of 0; then becomes 
E(6,|X; =x) =(«@+ yee (3.9.13) 


For x = 1, it gives the following value for the expectation 


4343 
14374 


We immediately conclude that the single-time words are overrepresented, 


E(6,;|X; = 1) =2x = 0.604. 


in the sense that if somewhere there exists a new Shakespeare canon equal 
in volume to the present one then the 14378 words appearing once in the 
present canon will appear in the new one only 0.604 x 14376 = 8683 times. 


3.9 Bayesian estimation 303 


Next, set 


ro = SOX: = 0 / So, 
i=1 l 


The numerator is estimated by 


E(6,|X; = 0)no = no = n1 = 14376 
no 


and the denominator by N = 884647. Then 


_ 14376 
884647 

So, with a stretch of imagination one deduces that, should a new Shake- 
spearean text appear, the probability that its first word will not be from the 
existing canon is 0.016; the same conclusion holds for the second word, etc. 
In fact, in 1985 the Bodleian Library in Oxford announced the discovery of 
a previously unknown poem that some experts attributed to Shakespeare. 


7 = 0.016. 


The above analysis was applied in this case [46] and gave an interesting 
insight. 


A 
Hypothesis testing 


4.1 Type I and type II error probabilities. 
Most powerful tests 


Statisticians do it with only a 5 percent chance of being rejected. 
(From the series ‘How they do it’). 


Testing a statistical hypothesis, or hypotheses testing, is another way to make 
a judgement about the distribution (PDF or PMF) of an ‘observed’ random 
variable X or a sample 
Xy 
X=]: 
Xn 
Traditionally, one speaks here about null and alternative hypotheses. The 
simplest case is where we have to choose between two possibilities: the 
PDF/PMF f of X is fo or fi. We say that f = fo will represent a simple 
null hypothesis and f = f, a simple alternative. This introduces a certain 
imparity between fp and f;, which will also be manifested in further actions. 
Suppose the observed value of X is x. A ‘scientific’ way to proceed is 
to partition the set of values of X (let it be R) into two complementary 
domains, C and C = R\C, and reject Hp (i.e. accept H,) when x € C while 
accepting Hp when x € C. The test then is identified with domain C (called 
a critical region). In other words, we want to employ a two-valued decision 
function d (the indicator of set C) such that when d(x) = 1 we reject and 
when d(x) = 0 we accept the null hypothesis. 
Suppose we decide to worry mainly about rejecting Ho (i.e. accepting H;) 
when Ho is correct: in our view this represents the principal danger. In this 
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case we say that a type I error had been committed, and we want errors of 
this type to be comfortably rare. A less dangerous error would be of type 
IT: to accept Ho when it is false (i.e. Hy takes place): we want it to occur 
reasonably infrequently, after the main goal that type I error is certainly rare 
has been guaranteed. Formally, we fix a threshold for type I error probability 
(TIEP), a € (0,1), and then try to minimise the type I error probability 
(THEP). 

In this situation, one often says that Ho is a conservative hypothesis, not 
to be rejected unless there is a clear evidence against it. 

For example, if you are a physician and see your patient’s histology data, 
you say the hypothesis is that the patient has a tumour while the alterna- 
tive is that he/she hasn’t. Under H; (no tumour), the data should group 
around ‘regular’ average values whereas under Ho (tumour) they drift to- 
wards ‘abnormal’ ones. Given that the test is not 100 percent accurate, 
the data should be considered as random. You want to develop a scientific 
method of diagnosing the disease. And your main concern is not to miss a 
patient with a tumour since this might have serious consequences. On the 
other hand, if you commit a type II error (alarming the patient falsely), it 
might result in some relatively mild inconvenience (repeated tests, possibly 
a brief hospitalisation). 

The question then is how to choose the critical region C. Given C, the 
TIEP equals 


P(e) = | toler € C)dx or S> fo(a)I(w EC), (4.1.1) 


and is called the size of the critical region C (also the size of the test). 
Demanding that probability Po(C) does not exceed a given a (called the 
significance level) does not determine C uniquely: we can have plenty of 
such regions. Intuitively, C must contain outcomes x with small values of 
fo(x), regardless of where they are located. But we want to be more precise. 
For instance, if PDF fo in Ho is N(j9,0%), could C contain points near 
the mean-value fig where PDF fo is relatively high? Or should it be a half- 
infinite interval (J19 +c, 00) or (—oo, 4g — c) or perhaps the union of the two? 
For example, we can choose c such that the integral 


boc 
: i oe H0)"/205. dx or : / ; 7 (@Ho)*/205 dy 
V 2700 Jpiote V 2709 J—co 


or their sum is < a. In fact, it is the alternative PDF/PMF, /f), that nar- 
rows the choice of C (and makes it essentially unique), because we wish the 
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TIHEP 
P,(@) = i fila) ¢C)de or So f(ale@ ge) (4.1.2) 


to be minimised, for a given significance level. The complementary proba- 
bility 


Pi(C).= [rere € C)dz or So fila) La EC) (4.1.3) 


is called the power of the test; it has to be maximised. 

A test of the maximal power among tests of size < a@ is called a most 
powerful (MP) test for a given significance level a. Colloquially, one calls it an 
MP test of size a (because its size actually equals a as will be shown below). 


4.2 Likelihood ratio tests. The Neyman—Pearson 
Lemma and beyond 


Statisticians are in need of good inference 
because at young age many of their hypotheses were rejected. 
(From the series ‘Why they are misunderstood’. ) 


A natural (and elegant) idea is to look at the likelihood ratio (LR): 


_ filz) 

fo(x) 
If, for a given x, Ay,,H (x) is large, we are inclined to think that Ho is 
unlikely, i.e. to reject Hp; otherwise we do not reject Ho. Then comes the 
idea that we should look at regions of the form 


{x : An, ,Ho(®) > k} (4.2.2) 


and choose & to adapt to size a. The single value x € R can be replaced 
here by a vector x = (21,...,%n) € R”, with fo(x) = [| fo(zi) and fi(x) = 
[| fi(a:). Critical region C then becomes a subset of R”. 

This idea basically works well, as is shown in the famous Neyman—Pearson 
(NP) Lemma below. This statement is called after J. Neyman and E.S. 
Pearson (1895-1980), a prominent UK statistician. Pearson was the son of 
K. Pearson (1857-1936), who is considered the creator of modern statistical 
thinking. Both father and son greatly influenced the statistical literature of 
the period; their names will repeatedly appear in this part of the book. 


At, Ho (2) (4.2.1) 
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It is interesting to compare the lives of the authors of the NP Lemma. 
Pearson spent all his active life at the University College London, where he 
followed his father. On the other hand, Neyman lived through a period of 
civil unrest, revolution and war in parts of the Russian Empire. See [115]. In 
1920, as a Pole, he was jailed in the Ukraine and expected a harsh sentence 
(there was war with Poland and he was suspected of being a spy). He was 
saved after long negotiations between his wife and a Bolshevik official who 
in the past had been a medical student under the supervision of Neyman’s 
father-in-law; eventually the Neymans were allowed to escape to Poland. In 
1925 Neyman came to London and began working with E.S. Pearson. One 
of their joint results was the NP Lemma. 

About his collaboration with Pearson, Neyman wrote: ‘The initiative for 
cooperative studies was Pearson’s. Also, at least during the early stages, he 
was the leader. Our cooperative work was conducted through voluminous 
correspondence and at sporadic get-togethers, some in England, others in 
France and some others in Poland. This cooperation continued over the 
decade 1928-38’. 

The setting for the NP Lemma is as follows. Assume Ho: f = fo is to be 
tested against H,: f = fi, where f; and fo are two distinct PDFs/PMFs. 
The NP Lemma states: 

For all k > 0, the test with the critical region C = {x : fi(x) > kfo(x)} 
has the highest power P,(C) among all tests (i.e. critical regions C*) of size 
Po(C*) <a=Po(C). 

In other words, the test with C = {fi(x) > kfo(x)} is MP among all 
tests of significance level a = Po(C). Here a appears as a function of k: 
a=a(k), k>0. 


Proof Writing I¢ and Ic» for the indicator functions of C and C*, we have 
0 < [Ic(x) — Ie«(x)] [filx) — kfo(x)], 


as the two brackets never have opposite signs (for x € C: fi(x) > kfo(x) and 
Ic(x) > Ic«(x) while for x € C: fi(x) < kfo(x) and Ic(x) < Ic+(x)). Then 


0< / [Ic(x) — Ie«(x)] | f(x) — kfo(x)]dx 


O< dw Te(x) — Te«(x)] [fi(x) — kfo(x)]. 


But the RHS here equals 
P(C) — Py(C*) — k[Po(C) — Po(C*)]. 
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So 
Pi(C) — Py(C*) > k[Po(C) — Po(C*)]. 
Thus, if Po(C*) < Po(C), then Pi(C*) < Pi(C). 


The test with C = {x: fix) > kfo(x)} is often called the likelihood ratio 
(LR) test or the NP test. 

The NP Lemma (either with or without the proof) and its consequences 
are among the most popular Tripos topics in Cambridge University IB Statis- 
tics. 


Remark 4.2.1 (1) The statement of the NP Lemma remains valid if the 
inequality fi(x) > kfo(x) is replaced by fi(x) > kfo(x). 

(2) In practice, we have to solve an ‘inverse’ problem: for a given a € (0, 1) 
we want to find an MP test of size < a. That is we want to construct k as 
a function of a, not the other way around. In all the (carefully selected) 
examples in this section, this is not a problem as the function k +> a(k) is 
one-to-one and admits a bona fide inversion k = k(a). Here, for a given a 
we can always find k > 0 such that P(C) = a for C = {fi (x) > kfo(x)} (and 
finding such k is a part of the task). 


However, in many examples (especially related to the discrete case), the 
NP test may not exist for every value of size a. To circumvent this difficulty, 
we have to consider more general randomised tests in which the decision 
function d may take not only values 0, 1, but also intermediate values from 
(0,1). Here, if the observed value is x, then we reject Ho with probability 
d(x). (As a matter of fact, the ‘randomised’ NP Lemma guarantees that 
there will always be an MP test in which d takes at most three values: 0, 1 
and possibly one value between.) 

To proceed formally, we want first to extend the concept of the TIEP 
and THEP. This is a straightforward generalisation of equations (4.1.1) 
and (4.1.2): 


Po(d) = | d(x) fox)ax or 7 dx) fo(x) (4.23) 


and 


P,(d) = Jo — d(x)] fi(x)dx or Sol — d(x)] fi (x). (4.2.4) 


x 


The power of the test with a decision function d is 


P,(d) = } d(x) fi(x)dx or S~ d(x) f(x). (4.2.5) 
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Then, as before, we fix a € (0,1) and consider all randomised tests d of size 
<a, looking for the one amongst them which maximises P;(d). 

In the randomised version, the NP Lemma states: 

For any pair of PDFs/PMFs fo and f, anda € (0,1), there exist unique 
k > 0 and p € [0,1] such that the test of the form 


1, fi(x) > kfo(z), 
dyp(z) = 4 p, fi(x) =kfo(a), (4.2.6) 
0, fila) < kfo(z), 


has Po(dnp) = a. This test maximises the power among all randomised tests 
of size a: 
Pi(dnp) = max [Pi(d*): Po(d*) <a]. (4.2.7) 


That is dnp is the MP randomised test of size < a for Ho: f = fo against 
Ay: f= fi. 
As before, the test dnp described in formula (4.2.6) is called the (ran- 
domised) NP test. We want to stress once more that constant p may (and 
often does) coincide with 0 or 1, in which case dyp becomes non-randomised. 
The proof of the randomised NP Lemma is somewhat longer than in the 
non-randomised case, but still quite elegant. Assume first that we are in the 
continuous case and work with PDFs fo and f; such that for all k > 0 


/ fo(x)I (fu(x) = kfo(x)) dx = 0. (4.2.8) 


In this case value p will be 0. 
In fact, consider the function 


G: kw [ tor (neo > kfo(x)) dx. 


Then G(k) is monotone non-increasing in k, as the integration domain 
shrinks with k. Further, function G is right-continuous: lim, ,4+ G(r) = 
G(k) for all k > 0. In fact, when r \, k4, 


I(filx) > rfo(x)) 7 2 (fix) > kfo(x)), 


as every point x with f(x) > kfo(x) is eventually included in the (expand- 
ing) domains {f1(x) > rfo(x)}. The convergence of the integrals is then a 
corollary of standard theorems of analysis. Moreover, in a similar fashion 
one proves that G has left-side limits. That is 


Gk) = lim G(I) 
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exists for all k > 0 and equals 


J fol201 (tr) = kfo(o) ax 
Under assumption (4.2.8), the difference 
Gk) — G(R) =f fol291 (ful) = bifold) ax 
vanishes, and G is continuous. Finally, we observe that 
G(0+) = a Gls) =, G(co) = jim Gk) = 0. (4.2.9) 


Hence G crosses every level a € (0,1) at some point k = k(a) (possibly not 
unique). Then the (non-randomised) test with 


ty Fi > Efex), 
dnp(x) = { 0, filx) < kfo(x) 


has size Po(d) = G(k) = a and fits formula (4.2.6) with p = 0. 
This is the MP test of significance level a. In fact, let d* be any other 
(possibly randomised) test of size Po(d*) < a. We again have that for all x 


0 < [dnp(x) — B(x) |[fi0x) — k fol), (4.2.10) 


since if dnp(x) = 1, then both brackets are > 0 and if dyp(x) = 0, they are 
both < 0. Hence, 


O< [laxees — d*(x)|[fi(x) — k(a@) fo(x)|dx 


but the RHS again equals P;(d) —Pi(d*) —k|Po(d)—Po(d*)]. This implies that 


P,(d) — P;(d*) > k[Po(d) — Po(d*)], ie. Pa(d) > Pi(d*) 


if Po(d) > Po(d"). 
We are now prepared to include the general case, without assumption 
(4.2.8). Again consider the function 


G: be f foo x) > kfo(x)) dx or 2 folx) x) > kfo(x)). 
(4.2.11) 


It is still monotone non-decreasing in k, right-continuous and with left-side 
limits. (There exists a convenient French term ‘cadlag’ (continue a droite, 
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limits & gauche).) Also, equation (4.2.9) holds true. However, the difference 
G(k—) — G(k) may be positive, as it equals the integral or the sum 


J tol201 (fi) = kfo(x) ax on 3 fo(4)7 il) = B40) 


that do not necessarily vanish. It means that, given a € (0,1), we can only 
guarantee that 4 k = k(a) > 0 such that 


G(k—-) > a> G(k). 


If G(k—) = G(k) (i.e. G is continuous at k), the previous analysis is appli- 
cable. Otherwise, set 
__a-Gik) 
P= Gk) — G(k) 
Then p € [0,1], and we can define the test dyp by formula (4.2.6). (If p = 0 
or 1, dxp is non-randomised.) See Figure 4.1. 


(4.2.12) 


The size 
o(dnp) = [ to aveGoax 
equals 
[ folt(tu0e) > kolo) ax+ p [ fol01(Fulx) = kfole9))ax 
= G(k) + ae an (GR) ER) =a. 


It remains to check that dyp is the MP test of size < a. This is done as 
before, as inequality (4.2.10) still holds. 

There is a useful corollary of the randomised NP Lemma: 

If d(x) is an MP test of size a = Po(d). Then its power 8 = Pi(d) cannot 
be less than a. 


G(k) 5 


———— od 


Figure 4.1 
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Proof In fact, consider a (randomised) test with d*(x) = a. It has Po(d*) = 
a = P;(d*). As dis MP, 8 > P,(d*) =a. 


NP tests work for some problems with composite (i.e. not simple) null 
hypotheses and alternatives. A typical example of composite hypotheses is 
where 0 = R, Ho: 6 < 09 and Hy: 8 > @ for some given 09. Quite often one 
of Ho and Hy is simple (e.g. Ho: 6 = 69) and the other composite (9 > 69 or 
6 < & or OF Op). 

A general pair of composite hypotheses is Ho: 0 € Oo against Hy: 6 € Oj, 
where Og and ©, are disjoint parts of the parameter set 0. To construct a 
test, we again design a critical region C C R” such that Hp is rejected when 
x €C. As in the case of simple hypotheses, the probability 


Po(C) = Pe(reject Ho), 6 € Oo, (4.2.13) 


is treated as TIEP. Similarly, the probability 


Po(C) = Pe(accept Ho), 6 € O1, (4.2.14) 
is treated as the TIIEP, while 
Po(C) = Poe(reject Ho), 6 € O1, (4.2.15) 


is treated as the power of the test with critical region C. Here, they all are 
functions of 0, either on Qo or on Oy. (In fact, equations (4.2.13) and (4.2.15) 
specify the same function 6 ++ P9(C) considered on a pair of complementary 
sets Qo and Oj.) 
To state the problem, we adopt the same idea as before. Namely, we fix 
a significance level a € (0,1) and look for a test (i.e. a critical region) such 
that: (i) 
Py(C) <a, 0€ Op. (4.2.16) 
(ii) For every test with critical region C* satisfying condition (4.2.16), 
Po(C) > Pe(C*), 0€ Or. (4.2.17) 


Such a test is called uniformly most powerful (UMP) of level a (for testing 
Ho: 0 € Op against Hy: 0 € O}). 

An important related concept is a family of PDFs/PMFs f( - ;6),0@ € 
O CR, with a monotone likelihood ratio (MLR). Here, we require that for 
any pair 01,42 € O with 6; < 4, 


f(x; 92) 
f(x; 91) 


No.,0,(X) = = 962.0, (T(x)). (4.2.18) 


4.2 Likelihood ratio tests 313 


where T is a real-valued statistic (i.e. a scalar function depending on x only) 
and go,,9,(y) is a monotone non-decreasing function (g9,,6,(y) < 9o2,0,(y’) 
when y < y’). In this definition, we can also require that go,,9, (y) is a mono- 
tone non-increasing function of y; what is important is that the direction of 
monotonicity is the same for any 0; < 4. 


Example 4.2.2 A hypergeometric distribution Hyp(N, D,n). You have a 
stack of N items and select n of them at random, n < N, for a check, without 
replacement. If the stack contains D < N defective items, the number of 
defects in the selected sample has the PMF 


Giles: 

a n-—2 

fie) = a , © =max|0,n+ D—N],...,min [D,n]. 
() 

Here, 0 = D. The ratio 


josie). DIET N-D=-a+o 
fo(z) N-D D+1-z 


(4.2.19) 


is monotone increasing in x; hence we have an MLR family, with T(x) = x. 

The hypergeometric distribution has a number of interesting properties 
and is used in several areas of theoretical and applied probability and statis- 
tics. In this book it appears only in this example. However, we give below a 
useful equation for the PGF Es* of an RV X ~ Hyp(N, D,n): 


min[n,D] 
mx = So (—D)k(—m)e C s)e 
re (—N )k ki 


In this formula, 0 < max[n, D] < N, and (a), is the so-called Pochhammer 
symbol: (a), = a(a+1)---(a+k—1). The series on the RHS can be written 
as 29f\(—D,-—n;—N;1-—s), or 2F\ Ga Here 2F{ is the so-called 
Gauss hypergeometric function. In [63], a different (and elegant) recipe is 
given for calculating the PGF of Hyp(N, D,n). 

Amazingly, in many books the range of the hypergeometric distribution 
(i.e. the set of values of x for which fp(xz) > 0) is presented in a rather 
confusing (or perhaps, amusing) fashion. In particular, the left-hand end 
(n+ D—N)+4 = max|0,n+ D—N] is often not even mentioned (or worse, 
replaced by 0). In other cases, the left- and right-hand ends finally appear 
after an argument suggesting that these values have been learned only in 


the course of writing (which is obviously the wrong impression). 


314 Hypothesis testing 


Example 4.2.3 A binomial distribution Bin(n,@). Here, you put the 
checked item back in the stack. Then 


f(x:0) = (  ) ea ay a =0,1,...,n, 


where 0 = D/N € [0,1]. This is an MLR family, again with T(x) = 2. 
Example 4.2.4 In general, any family of PDFs/PMFs of the form 
f(x; 0) = C(O) H (2) exp [Q() R(a)] , 


where Q is a strictly increasing or strictly decreasing function of 0, has an 
MLR. Here, 


= ~ exp (Q(62) — Q(A1)) > Ra) , 


which is monotone in T(x) = )¢, R(a;). In particular, the family of normal 
PDFs with a fixed variance has an MLR (with respect to @ = uw € R) 
as well as the family of normal PDFs with a fixed mean (with respect to 
§ = o? > 0). Another example is the family of exponential PDFs f(a; 0) = 
be" I(x >0), 0>0. 


Example 4.2.5 Let X be an RV and consider the null hypothesis Ho : 
X ~ N(0,1) against H,: X ~ f(x) = ze l/2 (double exponential). We are 
interested in the MP test of size a,a < 0.3. 

Here we have 


f(a|th) fel) = C exp | (0? = la). 
So f(a|H1)/f(x|Ho) > K iff x? —|2| > K’ iff |x| > t or |x| < 1—-t, for some 


i> 172: 
We want a = Py, (|X| >t) + Pa, (|X| < 1—t) to be < 0.3. Ift <1, 


Py (|X| > t) > Px (|X| > 1) = 0.3174 > a. 
So, we must have t > 1 to get 
a = Py,(|X| > t) = &(-t) +1 —- O(2). 


So, t = ®—!(1 — a/2), and the test rejects Ho if |X| > t. The power is 


1 t 
Py, (|X| >t) =1- i/ e l/2gy = e#/?, 
—t 
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The following theorem extends the NP Lemma to the case of families with 
an MLR, for the one-side null hypothesis Hg : 6 < 09 against the one-side 
alternative H, :@> 4: 

Let f(x;0),0 € R, be an MLR family of PDFs/PMFs. Fiz 09 € R and 
k > 0 and choose any 0; > 09. Then the test of Hp against Hy, with the 
critical region 


C={x: f(x;01) > kf (x; 4)}, (4.2.20) 


is UMP of significance level a = Pg,(C). 

This statement looks rather surprising because the role of value 61 is not 
clear. In fact, as we will see, 9; is needed only to specify the critical region 
(and consequently, size a). More precisely, owing to the MLR, C will be 
written in the form 


{F(x; 0) > k(6") fx; A0)} 


for all 6’ > 69, for a suitably chosen k(6’). This fact will be crucial for the 
proof. 


Proof Without loss of generality, assume that all functions gg, 9, in the 
definition of the MLR are monotone non-decreasing. First, owing to the NP 
Lemma, the test (4.2.20) is NP of size < a = Pg,(C) for the simple null 
hypothesis f = f(- ;@0) against the simple alternative f = f(-; 61). That is 


Po, (C) > Po,(C*) for all C* with Pg, (C*) <a. (4.2.21) 
By using the MLR property, test (4.2.20) is equivalent to 
Carta Tay srt 


for some value c. But then, again by the MLR property, for all 6’ > 0, test 
(4.2.20) is equivalent to 


C= {f(x 0’) > k(O) f(x; O)}, (4.2.22) 


for some value k(0’). Now, again owing to the NP Lemma, test (4.2.20) (and 
hence test (4.2.20)) is MP of size < Pg,(C) = a for the simple null hypothesis 
f = f(- 300) against the simple alternative f = f(-;6’). That is 


Pg (C) > Pe (C*) for all C* with Pg,(C*) <a. 


In other words, we have established that test (4.2.20) is UMP of signif- 
icance level a, for the simple null hypothesis f = f( - ;09) against the 
one-side alternative that f = f(-;6’) for some 6’ > 09. Formally, 


Py (C) > Pe (C*) whenever 6’ > 09 and Pg,(C*) <a. 
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But then the same inequality will hold under the additional restriction (on 
C*) that P9(C*) < a for all 6 < 69. The last fact means precisely that test 
(4.2.22) (and hence test (4.2.20)) gives a UMP test of significance level a 
for Ho: 8 < 09 versus Hy: 0 > %. 


Analysing the proof, one can see that the same assertion holds when 
Ho is 0 < 09 and Hy: 0 > Oo. As in the case of simple hypotheses, the 
inverse problem (to find constants k(0’),0’ > 69, for a given a € (0,1)) 
requires randomisation of the test. The corresponding assertion guarantees 
the existence of a randomised UMP test with at most three values of the 
decision function. 


4.3 Goodness of fit. Testing normal distributions, 
1: homogeneous samples 


Fit, Man, Test, Woman. 
(From the series ‘Movies that never made it to the Big Screen’.) 


The NP Lemma and its modifications are rather exceptional examples in 
which the problem of hypothesis testing can be efficiently solved. Another 
collection of (practically important) examples is where RVs are normally 
distributed. Here, the hypotheses testing can be successfully done (albeit in 
a somewhat incomplete formulation). This is based on the Fisher Theorem 
that if X1,..., Xp, is a sample of IID RVs X; ~ N(u, 07), then 

Jn 


as 1 
mcs — yw) ~ N(0,1) and a2 oxx ~ y2_1, independently. 
Here X = >, X;/n and Sxx = 0;(X; — X)?. See Section 3.5. 
A sample 
Xj 


X=| : |, with X;~ N(p,07), 
Xn 
is called homogeneous normal; a non-homogeneous normal sample is where 


X; ~ N(ui,07), ie. the parameters of the distribution of X; varies with i. 
One also says that this is a single sample (normal) case. 
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Testing a given mean, unknown variance. Consider a homogeneous 
normal sample X and the null hypothesis Ho: w = po against Hy: uw ~ po. 
Here po is a given value. Our test will be based on Student’s t-statistic 


— VSSxx/(n—No2 xx 
where sxx is the sample standard deviation (see Worked Example 1.4.8). 
According to the definition of the t distribution in Example 3.1.3 (see equa- 
tion (3.1.6), T(X) ~ tp-1 under Ho. A remarkable fact here is that calcu- 
lating T(x) does not require knowledge of o?. Therefore, the test will work 


regardless of whether or not we know o?. 


Hence, a natural conclusion is that if under Ho the absolute value |T(x)| 
of t-statistic T(x) is large, then Ho is to be rejected. More precisely, given 
y € (0,1), we will denote by t,_1(y) the upper 7 point (quantile) of the ty_1 
distribution, defined as the value a for which 


/ ” fix a(a)de = 7 


(the lower 7 point is of course —t,_1(y)). Then we reject Ho when, with 
jt = fo, we have |T(x)| > tn—1(a/2). 

This routine is called the Student, or t-test (the t distribution is also often 
called the Student distribution). It was proposed in 1908 by W.S. Gossett 
(1876-1937), a UK statistician who worked both in Ireland and England and 
wrote under the pen name ‘Student’. Gossett’s job was with the Guinness 
Brewery, where he was in charge of experimental brewing (he was educated 
as a chemist). In this capacity he spent an academic year in K. Pearson’s 
Biometric Laboratory in London, where he learned Statistics. Gossett was 
known as a meek and shy man; there was a joke that he was the only person 
who managed to be friendly with both K. Pearson and Fisher at the same 
time. (Fisher was not only famous for his research results but also renowned 
as a very irascible personality. When he was confronted (or even mildly 
asked) about inconsistencies or obscurities in his writings and sayings he 
often got angry and left the audience. Once Tukey (who was himself famous 
for his ‘brashness’) came to his office and began a discussion about points 
made by Fisher. After five minutes the conversation became heated, and 
Fisher said: ‘All right. You can leave my office now.’ Tukey said ‘No, I won’t 
do that as I respect you too much.’ ‘In that case’, Fisher replied, ‘I’ll leave’.) 

Returning to the t-test: one may ask what to do when |t| < ty,_1(a/2)? 
The answer is rather diplomatic: you then don’t reject Ho at significance 
level a (as it is still treated as a conservative hypothesis). 
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The t-statistic can also be used to construct a confidence interval (CI) for 
pw (again with or without knowing o?). In fact, in the equation 


— V(X = w) 
SX X 


P (~t»-a(a/2) 2 tn-1(a/2)) Sel 


the inequalities can be imposed on yu: 


= 1 = 1 
P (x = Ga ee) <pm<X+ Saaxxtn-a(a/2)) =l-a. 
Here P stands for P,,,2, the distribution of the IID sample with X; ~ 
N(0, 07). This means that a 100(1 — a) percent equal-tailed CI for pu is 


(x = Saaxxtn-1(0 2), pene Sasxxtn-a(a/2)) : (4.3.2) 


Some statisticians don’t drink because they are t-test totallers. 
(From the series ‘Why they are misunderstood’. ) 


Example 4.3.1 The durability is to be tested, of two materials a and b 
used for soles of ladies’ shoes. A paired design is proposed, where each of 
10 volunteers had one sole made of a and one of b. The weather (in suitable 
units) is shown in the table below. 


Volunteer 1 2 3 4 5 6 7 8 9 10 
a 14.7 9.7 11.3 14.9 11.9 7.2 9.6 11.7 9.7 14.0 

b 14.4 9.2 10.8 14.6 12.1 6.1 9.7 11.4 9.1 13.2 
Difference X;| 0.3 0.5 0.5 0.3 -0.2 1.1 -0.1 0.3 0.6 0.8 


Assuming that differences X1,...,X 19 are IID N(p,07), with pp and o? 
unknown, one tests Ho: w = 0 against H1: wp ~ 0. Here = 0.41, Sy = 
)2, x? — nz = 1.349 and the t-statistic 


2 
t = V10 x 0.41/4/1.349/9 = 3.35. 


In a size 0.05 test, Ho is rejected as to(0.025) = 2.262 < 3.35, and one 
concludes that there is a difference between the mean wear of a and b. This 
is a paired samples t-test. A 95 percent CI for the mean difference is 


\/1.349/9 x 2.262 \/1.349/9 x 2.262 
0.41 es 0.414 BBs = (0.133, 0.687). 
V10 V10 
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Historically, the invention of the t-test was an important point in the 
development of the subject of statistics. As a matter of fact, testing a similar 
hypothesis about the variance of the normal distribution is a simpler task, 
as we can use statistic Sx x only. 


Testing a given variance, unknown mean. We again take a homoge- 
neous normal sample X and consider the null hypothesis Ho: 0? = o? against 
Hy: o? 4 02, where o@ is a given value. As was said above, the test is based 


on the statistic 
1 


1 — 
ap oxx ae 2% — X)’, (4.3.3) 
which is ~ x2_, under Ho. Hence, given a € (0,1), we reject Ho in an 
equal-tailed two-sided test of level @ when the value Sy, /o2 is either less 
than h>_,(a/2) (which would favour o? < 0?) or greater than h?*_,(a/2) 
(in which case o? is probably > 02). Here and in what follows h;,(7) stands 
for the upper y point (quantile) of y2,, i.e. the value of a such that 


[PO fa(oiae = 4. 


Similarly, h;,(7) is the lower y point (quantile), i.e. the value of a for which 


[ fa@ee= 7 
0 


as was noted before, hi,(y) = hi#,(1— 7), 0< 7 < 1. By fl2 (x) we denote 
here and below the x? PDF with m degrees of freedom. 

This test is called the normal x? test. It works without any reference to 
p which may be known or unknown. 

The normal x? test allows us to construct a CI for a”, regardless of whether 
we know yp or not. Namely, we rewrite the equation 


P (mi s(0/2) 2 ASxx < nt s(a/2)) =l-a 


as 


P Bec ey _SxX =l-a 
hy (a/2) hy, (a@/2) 


(here P is P,,,2, the sample distribution with X; ~ N(p, o”)). Then, clearly, 
a 100(1 — a) percent equal-tailed CI for a? is 


( SXX SXX ye (4.3.4) 


hy_,(a@/2) h,,-1(a/2) 
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Example 4.3.2 At a new call centre, the manager wishes to ensure that 
callers do not wait too long for their calls to be answered. A sample of 30 
calls at the busiest time of the day gives a mean waiting time of 8 seconds 
(judged acceptable). At the same time, the sample value of S,./(n — 1) is 
16, which is considerably higher than 9, the value from records of other call 
centres. The manager tests Hp: o? = 9 against H,: o? > 9, assuming that 
call waiting times are ITD N(j,07) (an idealised model). 

We use a y? test. The value Sp_./o? = 29x 16/9 = 51.55. For a = 0.05 and 
n = 30, the upper a point h*_,(q@) is 42.56. Hence, at the 5 percent level 
we reject Ho and conclude that the variance at the new call centre is > 9. 

The Cl for o? is 


( 16 x 29 


= (10.15, 28.91 
45.72 ’ 16.05 ) peer 


as h3g(0.025) = 45.72 and hjy(0.975) = 16.05. 


Both the t test and normal x? tests are examples of so-called goodness of 
fit tests. Here, we have a null hypothesis Hp corresponding to a ‘thin’ subset 
Oo in the parameter set O. In Examples 4.3.1 and 4.3.2 it was a half-line 
{uu = po, 07 > 0 arbitrary} or a line {u € R, o? = o2} embedded into 
the half-plane @ = {y € R, o? > 0}. The alternative was specified by the 
complement 0; = © \ Oo. We have to find a test statistic (in the examples, 
T or Sxx), with a ‘standard’ distribution under Hp. Then we reject Ho at 
a given significance level a when the value of the statistic does not lie in the 
high-probability domain specified for this a. See Figure 4.2. Such a domain 
was the interval 


(—tn-1(a/2), tn—1 (7/2) 


h_,(a/2) ht (/2) 


Figure 4.2 
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for the t-test and 
( n-1(0/2), hit_4(a/2)) 
for the x? test. In this case one says that the data are significant at level 


a. Otherwise, the data at this level are declared insignificant, and Hp is not 
rejected. 


Remark 4.3.3 In this section we followed the tradition of disjoint null and 
alternative hypotheses (09 N @; = 9). Beginning with the next section, the 
null hypothesis Hp in a goodness of fit test will be specified by a ‘thin’ set 
of parameter values 09 while the alternative H, will correspond to a ‘full’ 
set O D Oo. This will not affect examples considered below as the answers 
will be the same. Also a number of problems in Chapter 5 associate the 
alternative with the complement © \ 09 rather than with the whole set 0; 
again the answers are unaffected. 


Insignificance is when you cut 10 snake heads 
and one of them survives. 
(From the series ‘Thus spoke Supertutor’.) 


4.4 The Pearson y” test. The Pearson Theorem 


Statisticians do it with significance. 
(From the series ‘How they do it’.) 


Historically, the idea of the goodness of fit approach goes back to K. Pearson 
and dates to 1900. The idea was further developed in the 1920s within the 
framework of the so-called Pearson chi-square, or Pearson, or y?, test based 
on the Pearson chi-square, or Pearson, or x7, statistic. A feature of the 
Pearson test is that it is ‘universal’, in the sense that it can be used to check 
the hypothesis that X is a random sample from any given PMF/PDF. The 
hypothesis is rejected when the value of the Pearson statistic does not fall 
into the interval of highly probable values. The universality is manifested 
in the fact that the formula for the Pearson statistic and its distribution do 
not depend on the form of the tested PMF/PDF. However, the price paid 
is that the test is only asymptotically accurate, as n, the size of the sample, 
tends to oo. 

X1 

Suppose we test the hypothesis that an ITD random sample X = 

Xn 

comes from a PDF/PMF f°. We partition R, the space of values for RVs X;, 
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into k disjoint sets (say intervals) D,,...,D, and calculate the probabilities 
pr = i f°(x)dx or Sow) (4.4.1) 
1 D, 


The null hypothesis Ho is that for all 2, the probability P(X; € Di) is equal 
to py, the value given in equation (4.4.1). The alternative is that they are 
unrestricted (apart from p; > 0 and p, +--- +p = 1). The Pearson x? 
statistic is 


ure): (4.4.2) 


| > 
f 


i=1 In 
where 7; (= nj(x)) is the number of values 2; falling in Dj, with ny +--+ + 
Ne = n, and ey = npy is the expected number under Ho. Letter P here 
is used to stress Pearson’s pioneering contribution. Then, given a € (0,1), 
we reject the null hypothesis at significance level @ when the value p of P 
exceeds hg s(@), the upper @ quantile of the y?_, distribution. 

This routine is based on the Pearson Theorem: 

Suppose that X,, Xo,...is a sequence of IID RVs. Let D,,...,Dx be a 
partition of R into pair-wise disjoint sets and set q = P(X; € Dj),l = 

.,k, with q +---+q, =1. Neat, for alll=1,...,k and n> 1, define 

Nin = the number of RVs X; among X1,...,Xn such that X; € Dj, with 
Mint-:-+Nen =n, and 


2 3 (Nin = ma)” (4.4.3) 
f=1 


Then for all a> 0: 


lim P(P, > a) = Fal : (4.4.4) 


NCO 


Proof We use the fact that relation (4.4.4) is equivalent to the convergence 
itPn. 


of the characteristic functions wp, (t) = Ee 


ita 
Jim vp, (é) =e Sige ak : xe" daz, tER. (4.4.5) 
Set 
Nin — 1 
Yin = —2— | so that Pa = > ¥2, (4.4.6) 


Jnrg a 
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and 


k 
So MinVG = 7a Nin nq) ee nda = 0. (4.4.7) 
[=i vn 


Our aim is to determine the limiting distribution of the random vector 


Yin 
Yn = : 
Yin 
Take the unit vector 
Ju 
= : 


VU 
and consider a real orthogonal & x k matrix A with kth column x. Such a 


matrix always exists: you simply complete vector « to an orthonormal basis 
in R* and use this basis to form the columns of A. Consider the random 


vector 
Zin 
Zn, = : 
Zkyn 
defined by 
Zin =A Vy 


Its last entry vanishes: 


k k 
Zin = >_VinAta = >—YinJ/G = 0. 
l=1 l=1 


At the same time, owing to the orthogonality of A: 


k k-1 
P= EY Zs (4.4.8) 
l=1 l=1 


Equation (4.4.7) gives insight into the structure of RV P,,. We see that to 
prove limiting relation (4.4.4) it suffices to check that RVs Z1n,...,Zk—1,n 
become asymptotically independent N(0,1). Again using the CHFs, it is 
enough to prove that the joint CHF converges to the product: 


k-1 
lim Belt Z1.nt-+ite—-1Z¢—1n — [[e”. (4.4.9) 


n—>0o 
l=1 


324 Hypothesis testing 


To prove relation (4.4.9), we return to the RVs Ni». Write 


! 
ne n n 
P (Nin = m1,.--,Nen = me) = ——— aq? --- gp", 
Nyt Nk: 


for all non-negative integers n1,...,n, with nj +---+n, = n. Then the 
joint CHF 


k n 
! 
it, tiNin — n. ny Ne id) tim it 
Bet tt Nin — 5 ——_ qi J ghkeiditim — y qe). 
my! +++ ng! 


N1yNk? Dy, M=N l=1 


Passing to Y1.n,-.., Yen gives 


k n 
Eel ~~ Yin = e ive dy ti/h ( 5 qeitt/v 3 
l=1 


and 


iF Rei Yin — rh (dace “iva Dita 


[=1 


+ 4 ava ay, +0 9) -ivn t/a 
l=1 l=1 l 
ix pik : ° —1/2 
=-5s ats So tiva +0(n 3 
l=1 l=1 


by the Taylor expansion. 


=nlin 


As n — oo, this converges to 


1 1 2 1 1 2 
—slltll? ts (At), —5llA*tl)? te (ATt),, 


(ATt)’, t=]: | eR” 
l=1 tr 


Here A is the above k x k orthogonal matrix, with A? = A}, and || ||? 
stands for the square of the norm (or length) of a vector from R* (so that 
2 
ltl? = 30, e7 and ||ATt|/? = Yo, (ATt);). 
Consequently, with (t, Yn) = Woe tiYin, we have 


lim Eel(t:¥») Te Bale (4.4.10) 


n—-> co 
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Then, for RVs Zn, in a similar notation 


Rel 1 tiZin = Rel (t.Zn) _— Rei(t.At Yn) _ Rel(At, Yn) 


which, by (4.4.9), should converge as n — oo to 


k-1 k-1 
II e (AT At)? /2 = II et /2. 
l=1 l=1 


This completes the proof. 


We stress once again that the Pearson x? test becomes accurate only in 
the limit n + oo. However, as the approximation is quite fast, one uses this 
test for moderate values of n, quoting the Pearson Theorem as a ground. 

It is interesting to note that K. Pearson was also a considerable scholar in 
philosophy. His 1891 book ‘The Grammar of Science’ gave a lucid exposition 
of scientific philosophy of the Vienna school of the late nineteenth century 
and was critically reviewed in the works of Lenin (which both of the present 
authors had to learn during their university years). However, Lenin made a 
clear distinction between Pearson and Mach, the main representative of this 
philosophical direction (and a prominent physicist of the period) and clearly 
considered Pearson superior to Mach. 


Example 4.4.1 In his famous experiments Mendel crossed 556 smooth 
yellow male peas with wrinkled green female peas. From this progeny, we 
define: 

N, = the number of smooth yellow peas, 

No = the number of smooth green peas, 

N3 = the number of wrinkled yellow peas, 

Ng = the number of wrinkled green peas, 


and we consider the null hypothesis for the proportions 


9 3 3 1 
Ho: — i 
0 (p1, P2, D3, Pa) (= 16’ 16’ a) ’ 
as predicted by Mendel’s theory. The alternative is that the p; are unre- 
stricted (with p; > 0) and pan p=. 
It was observed that (1, 2,3, n4) = (301, 121, 102, 32), with 


e = (312.75, 104.25, 104.25, 34.75), P = 3.39888, 


and the upper 0.25 point = hi (025) = 4.10834. We therefore have no reason 
to reject Mendel’s predictions, even at the 25 percent level. 
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Note that the above null hypothesis that f = f° specifies probabilities py 
completely. In many cases, we have to follow a less precise null hypothesis 
that the p) belong to a given family {p?(0), @ € O}. In this situation we 
apply the previous routine after estimating the value of the parameter from 
the same sample. That is the x? statistic is calculated as 


P(x) = yo ma, (4.4.11) 


e 
= d 


where € = np? (8) and 0 SS 0(x)) is an estimator of 0. 

Usually, @ is taken to be the MLE. Then the value 7 = ye (my —€)*/€ of 
statistic P is compared not with h{_,(a) but with CCP where |0| 
is the dimension of set 0. That is at significance level a € (0,1) we reject 
Ho: p) belong to family {p?(0), @ € ©} when 7 > he 1 je\(@). (Typically, 
the value hg, (a) is higher than hi, (a) when k, > kg.) This is based on a 
modified Pearson Theorem similar to the one above. 


4.5 Generalised likelihood ratio tests. The Wilks Theorem 


Silence of the Lambdas. 
(From the series ‘Movies that never made it to the Big Screen’.) 


The idea of using the MLEs for calculating a test statistic is pushed further 
forward when one discusses the so-called generalised likelihood ratio test. 
Here, one considers a null hypothesis Ho that @ € Og, where Oo C O, and 
the general alternative H,: 6 € O. A particular example is the case of Ho: 
f = f, where probabilities py; = P(D)) have been completely specified; in 
this case O is the set of all PMFs (p1,..., px) and Qo reduced to a single point 
(p, suds po). The case where py depend on parameter @ is a more general 
example. Now we adopt a similar course of action: consider the maxima 


max [f(x;0): 0 € Qo] 
and 
max [f(x;9): 0€ 9] 


and take their ratio (which is > 1): 


A Hy ;Ho (x) = (4.5.1) 
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Ary;Ho is called the generalised likelihood ratio (GLR) for Ho and H 1; some- 
times the denominator is called the likelihood of Hp and the numerator the 
likelihood of Ay. 

In some cases, the GLR Ay,.4, has a recognised distribution. In general, 
one takes 


R=2In Api (X). (4.5.2) 


Then if, for a given a € (0,1), the value of R in formula (4.5.2) exceeds the 
upper point h(a), Ho is rejected at level a. Here p, the number of degrees 
of freedom of the x? distribution, equals |O| — |Qo|, the difference of the 
dimensions of sets © and Op. 

This routine is called the generalised likelihood ratio test (GLRT) and is 
based on the Wilks Theorem, which is a generalisation of the Pearson Theo- 
rem on asymptotical properties of the y? statistic. Informally, this theorem 
is as follows. 

Suppose that X is a random sample with IID components X; and with 
PDF/PMF f(- ;0) depending on a parameter 0 € ©, where © is an open 
domain in a Euclidean space of dimension |O|. Suppose that the MLE 6(X) 
is an asymptotically normal RV asn — oo. Fir a subset Op C O such that Og 
is an open domain in a Euclidean space of a lesser dimension |Qo|. State the 
null hypothesis Ho that 0 € Oo. Then (as in the Pearson Theorem), under 
Ho, the RV R in formula (4.5.2) has asymptotically the ve distribution with 
p = || — |@o| degrees of freedom. More precisely, for all @ € 09 andh > 0: 


Jim Po(R > h) = | fg (ade. 


There is also a version of the Wilks Theorem for independent, but not IID, 
RVs X1, X2,.... The theorem was named after S. S. Wilks (1906-1964), an 
American statistician who for a while worked with Fisher. We will not prove 
the Wilks Theorem but illustrate its role in several examples. 


Example 4.5.1 A simple example is where X; ~ N(,07), with o? known 
or unknown. Suppose first that o? is known. As in the previous section, we 
fix a value po and test Ho: uw = po against Hy: yw € R unrestricted. Then, 
under Hy, Z is the MLE of p, and 


1 Sox 
max [f(x; 4): # € R) = (Jano)? exp ( = | 


where 
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whereas under Ho, 


where 
Big = doa ~ Ho)” 
i 
Hence, 
1 2 
AAy:Ho = €xp Io2 (5; = Sua) ) 
and 


1 Ep 
2 In Av,:Ho = ae (3 = Soa) = z(t = yu)? a Xi. 


We see that in this example, 21n Ay,.1,(X) has precisely a y7? distribution 
(i.e. the Wilks Theorem is exact). According to the GLRT, we reject Ho at 
level a when 2In Ay,.4,(x) exceeds h(a), the upper a quantile of the y7 
distribution. This is equivalent to rejecting Hp when |% — p|,/n/o exceeds 
z*(a/2), the upper a/2 quantile of the N(0,1) distribution. 

If o? is unknown, then we have to use the MLE G? equal to DF / n under 
Ho and Sxx / n under Hj. In this situation, 0? is considered as a nuisance 
parameter: it does not enter the null hypothesis and yet is to be taken into 
account. Then 


1 
under H,: max |f(x;p,07): ~ €R,o? > 0] = eee eC 
| (21 Sr2/n)"/? 
and 
1 
2). 72 — tne 
under Ho: max [f(x 10, 0 thet > 0] (Ons TnyrT2° P 
j10 
Hence, 


n/2 n/2 
y2 Alas 2\ n/ 
hi = () = (1 z a 


and we reject Ho when n(¥ — 9)?/ Sze is large. But this is again precisely 
the t-test, and we do not need the Wilks Theorem. 
We see that the t-test can be considered as an (important) example of a 


GLRT. 
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Example 4.5.2 Now let y be known and test Ho: 0? = 0? against 0? > 0 
unrestricted. The MLE of o? is ©?/n, where ©? = };(a; — y)*. Hence, 


under Hy: max [f (x; p, 07): a” >0| = (4) ee. 
ug 


and 
1 
under Ho: f(x; ,1,0%) = ex a 
0° £05 00) = Ba akRP P| 205 | 
with 
2\ 7/2 2 
NO n 
Mth :t (S ) P| 2° 7 
and 
2 2 2 2 2 
aon ee uN 
2InAy,:H) = nin m4 a — Te a 


We know that under Hp, ©?/o? ~ x2. By the LLN, the ratio =?/n converges 
under Ho to E(X, — yw)? = Var X, = o?. Hence, for large n, the ratio 


(=? — no?) i nog, is close to 0. Then we can use the Taylor expansion of the 
logarithm. Thus, 


Q 


Se a Oe Se ee m2 _ ng? 
DN Awad, fe | no ( 2) noo 


2 2 2 
nO 2 NOG or 


n (= 2) _ (= saat) 
2 no /2no4 
The next fact is that, by the CLT, as n > ov, 

y? nog 


This is because RV ©?(X) is the sum of IID RVs (X; — 1)?, with 


~ N(0,1). 


E(X;—p)? =o? and Var(X; — p)? = 205. 


Hence, under Ho, as n — ov, the square 


>? — no? : 2: 2 
——— ] ~ xj, ie. 2InAn,.H, ~ X17, 


/2noGg 


in agreement with the Wilks Theorem. 
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We see that for n large, the GLRT test is to reject Ho at level a when 
(52 nod)? 
> h(a). 
2nog 1(@) 
Of course, in this situation there is the better way to proceed: we might use 
the fact that 
oa Xn- 
So we reject Ho when 0?/o02 > hj (a/2) or ©?/02 < hz (a/2). Here we use 
the same statistic as in the GLRT, but with a different critical region. 
Alternatively, we can take 
DX 2 
7 ~ Xn-1> 
% 


where Sy x = )>"_,(X; — X)?. The normal y? test: 


1 
reject Hy when —Srz > hit_,(a/2) or < hy_,(a/2) 
70 


is again more precise than the GLRT in this example. 
In a similar way we treat the case where Ho: 0? = 02, Hi: 0? 4 of and ys 
is unknown (i.e. is a nuisance parameter). Here ji = %, 7 = Sxx/n, and 


i —n/2 


under Hy: max [f(q4,07): #€R,o” > 0] = poy age ™, 
7 S2,,./n 


and 


1 1 
under Ho: max [f(x Ll, 09) >We R] a (nor? exp | o2 s..| , 
0 0 


with the GLR statistic 
2\n/2 
—n/2 { [0 1 Soe 
Aamo = v0)" (SE) em (55H ): 


We see that Ay,:H is large when Szz/ ae is large or close to 0. Hence, in the 
GLRT paradigm, Hp is rejected when Sxx/ a3 is either small or large. But 
Sxx/ a3 ~ x2_,. We see that in this example the standard y? test and the 


GLRT use the same table (but operate with different critical regions). 


Another aspect of the GLRT is its connection with Pearson’s x? test. 
Consider the null hypothesis Hp : pj = pi(0),1 < i < k, for some pa- 
rameter 9 € Oo, where set Og has dimension (the number of independent 
co-ordinates) |O9| = ko < k— 1. The alternative H, is that probabilities p; 
are unrestricted (with the proviso that p; > 0 and )°, p; = 1). 
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Example 4.5.3 Suppose that RVs X1,...,Xm ~ N(ux,o%) and Yi,..., 
Yn, ~ N(py, 0%), where wx, 0%, py and o% are all unknown. Let Ho be 
that oy = ee and Hy, that oy > Ge Derive the form of the likelihood ratio 
test and specify the distribution of the relevant statistic. 

Here 


Ley(Ho) = max [fx (a3 x, 07) fy (y; uy,07) + wx ER, py ER, o > 0] 


1 Sra + Syy \ - 
(Ono2)mtm)]2 exp ( 92 ) :0> : 


Note that for g(x) = x%e~” with a,b > 0: 


sa) 


= max 


Hence, 


Pey(Ho) = Co aayim-ny72 


Similarly, under Hy 


Ley(H1) = max [fx (2; ux, 0%) fy (ys by, 07) + Ux, by ER, ox,cy > 0] 


1 en (mtn)/2 
(2x62) "2 (ang? )r?2 


with 62 = Sz,/m, 6 = Sy,/n (provided that 62. > G?.). As a result, 


m+n) /2 —m/2 —n/2 
i¢ 222 5 SW then A= e=3) a (=) / (=) 
m n ; 


m+n m n 
and 
if Siz < Sy then A = 1. 
m n 
Further, 
if 222 5 ™ then 2 nd=c+ (3). 
Syy n Syy 
Here 


f(u) = (m+n) In(1+u)— mln u, 
and c is a constant. Next, 
m+tnoom nu—-—m 


= 1+u wu u(l+u)’ 
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i.e. f increases when u > m/n increases. As a result, we reject Ho if Srz/Syy 
is large. Under Ho, 


S. 5. : 
of =0% =o", and = oe ee aaa ~ y2_,, independently. 
ol o 
Therefore, 
Sxx/(m — 1) 


a oe YY a 
Syy/(n — 1) a 


Example 4.5.4 There are k+1 probabilities po,..., pz. A null hypothesis 
Ho is that they are of the form 


pi(@) = ¢ 


a 
where 6 € (0,1). (Here |O9| = 1.) The alternative H, is that these probabil- 
ities form a k-dimensional variety. 


) OL = Oy: 02 <= he 


The MLE under Hp maximises ya nmilog pi(@), where no, ..., Ne are 
occurrence numbers of values 0,..., &. Let @ be the maximiser. An easy 
calculation shows that under Ho, 0 = Ss in;/(kn). Under H, the MLE is 
pi = ni/n, where n = no +--+ + ng. Then the logarithm of the GLR yields 


ko any 
2inA = on Liao _ = 25° nj In ue. 
TT (p;(0)) ‘ i np;(9) 
The number of degrees of freedom is k + 1—1-— |Oo9| = k—1. We reject Ho 
when 2InA > hj_,(a). 
Write np; (0) =e, jb =n —e;, O<i<k, with 57,6; =0. A straightfor- 
ward calculation is: 


2inA = 2) niin = =2) (ei +6i)In (1+2) 
i bp OF \ TN OF a (mi = i)” 


u 


This is precisely Pearson’s y? statistic. 


It has to be said that, generally speaking, the GLRT remains a universal 
and powerful tool for testing goodness of fit. We discuss two examples where 
it is used for testing homogeneity in non-homogeneous samples. 


Example 4.5.5 Let Xj,..., X;, be independent RVs, X; ~ Po(A;) with 
unknown mean \;,2=1,..., n. Find the form of the generalised likelihood 
ratio statistic for testing Ho : 41 = --- = An, and show that it may be 
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approximated by Z = )07_,(Xi — X)?/ X, where X =n! 07, Xi. If, for 
n = 7, you find that the value of this statistic was 26.9, would you accept Ho? 
For X; ~ Po(A;), 


Fe? ok. 
fx (x; A) =|] baer N, 
i=l 


Under Ho, the MLE d for \ is F= >), vi/n, and under Aj, » = x;. Then 
the GLR 
PREM ha = VEE 

f(x) 2" 


Am ;Ho (x) = 


and 


Xi 
2In Ay, :H) (x) = 2d ei In = 


=P ee sol (1+5=*) oe 


as In(1+ 2) = @ — 27/2. 
Finally, we know that for Z ~ x2, EZ = 6 and Var Z = 12. The value 
26.9 appears too large. So we would reject Ho. 


Example 4.5.6 Suppose you are given a collection of np independent RVs 
organised in n samples, each of length p: 


KY (Maan Nip) 
X() = (Xo1,..., Xap) 


AL) ea ©, Cuenca. oe 


The RV Xj; has a Poisson distribution with an unknown parameter Aj, 1 < 
j <p. You are required to test the hypothesis that A] = --- = Ap against 
the alternative that values A; > 0 are unrestricted. Derive the form of the 
likelihood ratio test statistic. Show that it may be approximated by 


with 


Explain how you would test the hypothesis for large n. 
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In this example, the likelihood 
P es vi 


— 


i=1j=1 


is maximised under Hp at A= zx, and has the maximal value proportional 
to e ™PtzZ"P?, Under Ay, ai = ¥; and the maximal value is proportional to 


the product e~"?? []" 7 Hence, 


5-155 


We reject Ho when 2 In Ay,.x,, is large. coe InAy,,H has the form 


P 
S (nz; nz; — npenz = in: S_ (@; nz; — F; nz) 
j=l j 


Under Ho, nX; ~ Po (nd) and npX ~ Po (npdA). By the LLN, both X; 
and X converge to the constant , hence (X; — X)/X converges to 0. This 
means that for n large enough, we can take a first-order expansion as a good 
approximation for the logarithm. Hence, for n large, 2In Ay, ,4, is 


Observe that the ratios Y;, = (X) — X) / VX resemble the ratios Via 
from definition (4.4.6). In fact, it is possible to check that as n > oo, 
X,—X 


V/n—= 

VX 

(This is the main part of the proof of the Wilks Theorem.) We also have, as 
in equation (4.4.7), that the Y;,, satisfy a linear relation 


Pp p 
S> YinwX 7 Wee = 
l=1 l=1 


~N (0,1). 
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Then as in the proof of the Pearson Theorem, as n > ov, 


2In An, Ho © eee 


Concluding this section, we provide an example of a typical Cambridge- 
style Mathematical Tripos question. 


Example 4.5.7 What is meant by a generalised likelihood ratio test? 
Explain in detail how to perform such a test. 

The GLRT is designed to test Ho: 6 € Og against the general alternative 
A: 6 € O, where Og C O. Here we use the GLR 


x 
i (x) max [f(x;6): 0€ 0] h : 
‘Hy (X) = ,wherex= | : 
pas max [f(x;0): 0 € Op] 
In 
The GLRT rejects Ho for large values of Ay,:H- 
If random sample X has IID components X,...,X, and n is large then, 


under Ho, 2In Aw,:H)(X) ~ Xp where the number of degrees of freedom 
p = |O| — |@o| (the Wilks Theorem). Therefore, given a € (0,1), we reject 
Ho in a test of size a if 


2 In Ay, :H(X) > hy (a). 


It is interesting to note that the GLRT was proposed by Neyman and E. 
Pearson in 1928. The NP Lemma, however, was proved only in 1933. 


Marrying leads to giving up at least one 
degree of freedom, statistically. 
(From the series ‘So spoke Supertutor’.) 


4.6 Contingency tables 


Statisticians do it by tables when it counts. 
(From the series ‘How they do it’.) 


A popular example of the GLRT is where you try to test independence of 
different ‘categories’ or ‘attributes’ assigned to several types of individuals 
or items. 


Example 4.6.1 An IQ test was proposed to split people approximately 
into three equal groups: excellent (A), good (B), and moderate (C). The 
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following table gives the numbers of people who obtained grades A, B, C in 
three selected regions in Endland, Gales and Grogland. The data are given 
below. 


Region Grade A Grade B Grade C Total nj+ 


Endland 3009 2832 3008 8849 
Gales 3047 3051 2997 9095 
Grogland 2974 3038 3018 9030 
Total n+; 9030 8921 9023 26974 


It is required to derive a GLRT of independence of this classification. Here, 
Ci = MN / ML are 


2962.35 2926.59 2960.06 
3044.70 3007.95 3042.34 . 
3022.94 2986.45 3020.60 


The value of the Pearson statistic (defined in equation (4.6.3) below) is 
4.56828 + 1.29357 + 1.68437 = 7.54622, 


with the number of degrees of freedom (3 — 1) x (3 — 1) = 4. At the 0.05 
level ht (0.05) = 9.488. Hence, there is no reason to reject the hypothesis 
that all groups are homogeneous at this level (let alone the 0.01 level). 


In general, you have a contingency table with r rows and c columns. The 
independence means that the probability p;; that say a type 7 item falls into 
category, or receives attribute, 7 is of the form a;3; where 


1 Cc 
a4, 8; =O and Sey a 
j=l 


i=1 
This will be our null hypothesis Ho. The alternative H, corresponds to a 
general constraint p;; > 0 and S7;_, pam pij = 1. Here we set 


PA =) Pp DG = Dy and eS pea Spy = 
j i i j 
(the notation comes from early literature). 

The model behind Example 4.6.1 is that you have n items or individuals 
(26974 in the example), and nj; of them fall in cell (7,7) of the table, so 
that ar ny =n. (Comparing with Example 4.2.2, we now have a model of 
selection with replacement which generalises Example 4.2.3.) Set 


Ni+ = ) Nj, N4j = ) nj and ny = y N+ = y N4j=N. 
j a a j 
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The RVs Njj have jointly a multinomial distribution with parameters 
(n; (ij): 
PONG = nap tor all 9) = =n! TJ — mle (4.6.1) 


(the pj; are often called ‘cell probabilities’). It is not difficult to recognise 
the background for the GLRT, where you have |©0;| = rc — 1 (as the sum 
p++ = 1) and |Oo| =r—1+c-—1, with 


[Si] = |8o| = (r= 1 ie =), 
Under H,, the MLE for p;; is pi; = nij/n. In fact, for all ny; with nz, =n, 
the PMF (i.e. the likelihood) is 


Nig 


Pij 
fis) ((nig); (piz)) =n! II =a? 
ay 


a 


and the LL is 
Lhe) (pe) =D niz In pj + In(n!) > In(nj;!) 


tJ 


We want to maximise ¢ in p;; under the constraints 
ij 2 0, > pig tks 
a,j 
Omitting the term 


= oS In(nj;!), 
4,9 


write the Lagrangian 


£L= So mig Inpig —A| So py - 
uJ 4,J 


Its maximum is attained when for all 7,7: 
6) _ Nij 
Opi; Pij 


i.e. when pjj = nij/A. Adjusting the constraint py, = 1 yields A=n. 
Under Ho, the MLE for a; is aj = ni+/n; for Bj the MLE 8; = =n4+j;/n. In 
fact, here the likelihood is 


Fy (rp) (0s) (83) = tT] SE = mi To TT or / [mat 
tJ a j ig 
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and the LL is 


£((ngz)} (og ates nei 2 In 6; + In(n!) => In(nj;!) 


This is to be maximised in a; and 6; under the constraints 
ai, Bj = 0,50 ai = S> 8; = 1. 
a J 


The Lagrangian now is 


C= Lmemart Dnsstad—a(Loi- 1] — pb S> Bj -1 
i j i j 


(with the term [In(n!) — }¢ In(n,;!)] again omitted). The stationarity condi- 
tion 
0 0 


L=~—L=0, 1<i<r1l<j<e, 
Dox AB; ‘ ~wsr 9 SC 


yields that a; = nj,/ and 8; = n,;/p. Adjusting the constraints gives that 
A=p=n. 
Then the GLR 


AHy;Ho = II (2) /T Co II Co ie ; 
aj a J 


and the statistic 2lnAy,:4, equals 


ae 7 Le 


Nj 
rae) cc a 


a 6.2) 
where ej; = (ni¢n4;)/n. Writing nj = ei; + 64; and expanding the loga- 
rithm, the RHS is 


big big 15% 
25° C45 + 643) In 1+ =o" ( C45 { bij) a 2 hr x toos 5 
iD Ci; a ejyj 2 ey 


which is 


by the same approximation as before. 
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Therefore, we can repeat the general GLRT routine: form the statistic 


2inA = 5) ee (naj =e)" eis)" (4.6.3) 


t=1 g=1 


and reject Ho at level a when the value of the statistic exceeds ho V(e— pla Q), 
the upper q@ point of Ke<iite-4)s 

Contingency tables also give rise to another model where you fix not 
only ni, the total number of observations, but also some margins, for in- 
stance, all (or a part) of row sums nj+. Then the random counts Nj; in the 
ith row with fixed nj; become distributed multinomially with parameters 
(Ni+; Pi1,---5 Pic), independently of the rest. The null hypothesis here is that 
Piz = pj does not depend on the row label 7, for all 7 = 1,...,c. The alter- 
native is that p;; are unrestricted but pj = 1, i = 1,...,r. This situation 
is referred to as testing homogeneity. 


Example 4.6.2 In an experiment 150 patients were allocated to three 
groups of 45, 45 and 60 patients each. Two groups were given a new drug at 
different dosage levels and the third group received a placebo. The responses 
are in the following table. 


Improved No difference Worse 
Placebo 16 20 9 
Half dose 17 18 10 
Full dose 26 20 14 


We state Ho as 


the probabilities PImproved; PNo difference and PWorse 
are the same for all three patient groups, 


and Hy as 
these probabilities may vary from one group to another. 


Then under Hy the likelihood is 


r 


ni+! v tc 
fo.) (my) (pi) = T] pale re hee ae 
a) 


i=1 


and the LL is 


£( (nis); ( (pi3)) = ni In pyj + (terms not depending on Das) 
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Here, the MLE pj; of pj; is nij/ni+. Similarly, under Ho the LL is: 


e(( Cagz) (piz)) = 2s nj lnpj + (terms not depending on uae 


and the MLE p; of p; equals n+;/n4+4. Then, as before, the GLR is 


2 A sty = 2 myn BE eh ce iy Sa 
4,J a,j “ij 

with e;, = (nizn4;)/n44. The number of degrees of freedom is equal to 
(r —1)(e— 1), as |O1| = r(e— 1) (c— 1 independent RVs p;; in each of r 
rows) and |©9| = c— 1 (the variables are constant along the columns). 

In the example under consideration: r = 3, c = 3, (r —1)(c — 1) = 4 and 
x7(0.05) = 9.488. The array of data and the corresponding array of expected 
values are shown in the tables below. 


Nij Improved No difference Worse | 
Placebo 16 20 9 45 
Half dose 17 18 10 45 
Full dose 26 20 14 60 
59 58 33 150 
and 
€ij Improved No difference Worse 
Placebo 17.7 17.4 9.9 
Half dose 17.7 17.4 9.9 
Full dose 23.6 23.2 13.2 


Consequently, Ay,:H, is calculated as 
(aye Oe 2 (OD? = 426)? (06) 
26 AT ATT ATA TA 
.2)? 9)? nye 8)? 
OZ) 5 OU Oe AS) a aiego: 
23.2 9.9 9.9 13.2 
This value is < 9.488, hence insignificant at the 5 percent level. 


We finish this section with a short discussion of an often observed Simpson 
Paradox. It demonstrates that contingency tables are delicate structures and 
can be damaged when one pools data. However, there is nothing mystic 
about it. Consider the following example. A new complex treatment has 
been made available, to cure a potentially lethal illness. Not surprisingly, 
doctors decided to use it primarily in more serious cases. Consequently, the 
new data cover many more of these cases as opposed to the old data spread 
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evenly across a wider range of patients. This may lead to a deceptive picture. 
For example, consider the following table. 


Didn’t recover | Recovered | Recovery % 
Previous treatment 7500 5870 43.9 
New treatment 12520 1450 10.38 


You may think that the new treatment is four times worse than the old 
one. However, the point is that the new treatment is applied considerably 
more often than the old one in hospitals where serious cases are usually dealt 
with. On the other hand, in clinics where cases are typically less serious the 
new treatment is rare. The corresponding data are shown in two tables 


below. 
Hospitals Didn’t recover | Recovered | Recovery % 
Previous treatment 1100 70 5.98 
New treatment 12500 1200 8.76 
and 
Clinics Didn’t recover | Recovered | Recovery % 
Previous treatment 6400 5800 47.54 
New treatment 20 250 92.60 


It is now evident that the new treatment is better in both categories. 
In general, the method of contingency tables is not free of logical difficul- 
ties. Suppose you have the data in a table below, coming from 100 two-child 


families. 
Ist Child 
Boy Girl 
Boy 30 20 50 
2nd Child 
Girl 20 30 50 
50 50 ~=100 


Consider two null hypotheses 


e ete The probability of a boy is 1/2 and the genders of the children are 
independent. 
e H?: The genders of the children are independent. 


Each of these hypotheses is tested against the alternative: the probabilities 
are unrestricted (which means that ppp, ppc, Pas, Pag only obey p.. > 0 
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and ppp + pea + pap + pac = 1; this includes possible dependence). The 
value of the Pearson statistic equals 

(30 — 25)? os (20-25)? (20 — 25)? i (30 — 25)? _ 

25 25 25 a 

The number of degrees of freedom, for Hig equals 3, with the 5 percent 


quantile 7.815. For Hj, the number of degrees of freedom is 1; the same 
percentile equals 3.841. 


4, 


We see that at significance level 5 percent, there is no evidence to reject 
H} but strong evidence to reject He, although logically He implies He. We 
thank A. Hawkes for this example (see [70]). 


4.7 Testing normal distributions, 
2: non-homogeneous samples 


Variance is what any two statisticians are at. 
(From the series ‘Why they are misunderstood’. ) 


A typical situation here is when we have several (in the simplest case, two) 
samples coming from normal distributions with parameters that may vary 
from one sample to another. The task then is to test that the value of a 
given parameter is the same for all groups. 


Two-sample normal distribution tests. Here we have two independent 
samples, 
X Y 
X= : and UY =| 3 
Xm Yn 
where the X; are IID N(y1,07) and the Y; are IID N(j12, 03). 
(a) Testing equality of means, common known variance In this model one 


assumes that 0? = 0? = o? is known and test Ho: jy = pg against Hy: pu, 


fig unrestricted. One uses the GLRT (which works well). Under Hj, the 
likelihood fx (x; p11, 07) fy (y; M2, 07) is 


1 1 1 
exp (ay = ya)? ~~ 59 (yj = fiz)? 


1 1 e is 
- er exp | oe (Som + m(F — p11)” + Syy + 2G — p2)”) 


(Varo 
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and is maximised at 


$27 ote Se ale are et: 
fy = @ and pa = y where B= — ) 8 T= — 
i j 


Under Hp, the likelihood fx (x; 1,07) fy(y; u,07) equals 


aipsce( atte aes 


and is maximised at 


~  mri+ny 
a m+n ~ 


Then the GLR 


1 mn — ) 


Ath;Ho = eXP Ee m+n 


and we reject Ho when |% — 9 is large. 


Under Ho, 
a 1 1 
X-Y~ w(o,0?(—+2)). 
Mm 
Hence, under Ho the statistic 
xX -Y 
Z = ——. ~ N(0, 1), 
av/1/m+1/n 
and we reject Ho at level a when the value of |Z| is > z4(a/2) = ®-+(1— 
a/2). The test is called the normal z-test. Note that the Wilks Theorem is 
exact here. 
In what follows z;(y) denotes the upper y point (quantile) of the normal 
N(0,1) distribution, i.e. the value for which 


1 ie —2?/24 
= e€ L=%. 
V2m Jz+(7) 

As was noted, it is given by ®-1(1— 7), O0< 7 <1. 

(b) Testing equality of means, common unknown variance Here the 
assumption is that 0? = 03 = o? is unknown: it gives a single nuisance 
parameter. Again, Ho: uw, = 2 and Ay: p11, 2 unrestricted. As we will see, 
the result will be a t-test. 


Under H, we have to maximise 


1 1 os 2 a 2 
att exp { Qo2 ae T mE [1) + Syy + ny H2) | } ’ 
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in 1, f2 and o”. As before, [1 = %, fi2 = y. The MLE for o? is 


Sax + Syy 
m+n 


where Sex = S(e1— 7) Sy = Dw? 


a j 
(The corresponding calculation is similar to the one producing G2 = Sz2/n 
in the single-sample case.) Hence, under H;, 
max [f(x; 11,07) f(y; 2,07) : p1,H2 ER, 07 > 0] 
_ (20% oP ao en (mtn)/2_ 


m+n 
Under Ho, the MLE for yp is, as before, 
aA mr+ ny 
* m+n 


and the MLE for o? is 


Hence, under Ho, 
max [f(x; 1,07) f(y;H,07): we R, 07 > 0] 


_(m-+n)/2 
_f Qn mn, v9 ZOD 
= {See t Sy + Dal : : 


This yields the following expression for the GLR: 


(* + n) (Sze + Syy) + mn(E — y)2 ) (m+n)/2 
AH,;Ho = 


(m + n)(Srx = Syy) 


Z (: 4 __mn(e = 9) — 
(m+ n)(Sze + Syy) 
Hence, in the GLRT, we reject Ho when 


Iz — 9 
/ (Sze + Syy) (1/m at 1/n) 
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is large. It is convenient to multiply the last expression by Wn +m — 2. This 
produces the following statistic: 


T= wee t 
SxxtSyy / 1 1 Tees 
m+n—2 (7 + “) 


In fact, under Ho, 
X-Y i 1 
—  N(0,1), S Sixx we Xe d =Syy ~ x7_4, 
oV/1/m+1/n ee gy, gee aa! 


independently. Thus 


The Wilks Theorem is of course valid in the limit m,n — oo, but is not 
needed here. 


Example 4.7.1 Seeds of a particular variety of plant were randomly as- 
signed to either a nutritionally rich environment (the treatment) or standard 
conditions (the control). After a predetermined period, all plants were har- 
vested, dried and weighed, with weights in grams shown in the following 
table. 


Control 417 5.58 5.18 6.11 4.50 461 5.17 4.538 5.33 
5.14 

Treatment | 4.81 4.17 4.41 3.59 5.87 3.83 6.03 4.89 4.32 
4.69 

Here control observations X1,...,Xm are IID N(x,07), and treatment 


observations Y},...,Y¥;, are IID N(py,07). One tests 
Aly: px = py against Hy: wx, py unrestricted. 
We have 
m =n = 10,F = 5.032, Spx = 3.060, 9 = 4.661, Sy, = 5.669. 


Then 
G? = Sra “F Syy 


= — = = 0.485 
m+tn—-2 


and 


= = 1.19 
Ve? (1/m + 1/n) 
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With t1g(0.025) = 2.101, we do not reject Ho at level 95 percent, concluding 
that there is no difference between the mean weights due to environmental 
conditions. 

Now suppose that the value of the variance o? is known to be 0.480 and 
is not estimated from the sample. Calculating 


a 1 1 
z=(e-al fo —+— 
m n 


yields the value z = 1.1974. Since ®(z) = 0.8847, which is less than 0.975, 
we still have no reason to reject Ho. 


(c) Testing equality of variances, known means First, consider the case 
where 41 and pz are known (and not necessarily equal). The null hypothesis 
is Ho: 0? = of and the alternative H): 0?, 0? unrestricted. Under Ho, the 
likelihood f(x; 41,07) f(y; W2, 07) is maximised at ? = (2, +22.,)/(m+n) 
and attains the value 


[ cae 


m+n 


Here 52, = oy(1i — pa)? and 32, = 3,(yy — aa)?. 
Under Hj, the likelihood f(x; 1,07) f(y; 2,07) is maximised at G7 = 
u7.,/m, 65 = X2,,/n and attains the value 


—m/2 —n/2 
2m : an Uiy en (m+n)/2_ 
m n 


2 2 
mm 2nn/2 oe, + be, = m2, + DR, ve 
(m + n)(m+n)/2 pa Di 


/2 2 
y2 my x2 n/ 
aay lee see (1+ 5] . 
( =) ae 


The last expression is large when ae /&2., is large or small. But under Ho, 


1 
o2 


Then the GLR 


Ny ;Ho = 


1 
= x~ con and Nyy ~ Mes independently. 
rol 


Thus, under Ho, the ratio pC Dye ~ Frm. Hence, we reject Ho at level 
a when the value of X7,,/X2,, is either greater than yj ,,(a/2) or less than 


Ynm(a/2). Here, and below, y7,,(y) denotes the upper and y;, (7) the 
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lower y point (quantile) of the Fisher distribution Fy m, i-e. the value a for 
which 


JP femlade=y orf” fiygla)de = 9, 


with Ppm(Y) = Pnm(l—7),0<7<1. 

Again, the strict application of the GLRT leads to a slightly different 
critical region (unequally tailed test). 

(d) Testing equality of variances, unknown means Now assume that p11 
and fg are unknown nuisance parameters and test Ho: 0? = 03 against 
Ay: of, o3 unrestricted. Under Ho, the likelihood f(x; 1,07) f(y; 2, 07) is 
maximised at 


ie —— 3 as 1 
M1 =X, P2=Y, G = (Sun + Syy); 


where it attains the value 


—(m+n) /2 
(2nS Su) es 


m+n 


Under Hj, the likelihood is f(x; p11, 07)f(y; 2,03). Its maximum value is 
attained at 


and equals 


1 m/2 1 n/2 
——— eee e7 (mtn) /2 
IiSce [He 2m Syy/n 


A _ Sr a Syy ne ier + Syy ne mm/2nn/2 
I;Ho — S. Sia (m we n)(m+n)/2” 


and we reject Ho when 


jee 14 4 
EO 
is large. 


But, as follows from Example 3.1.4 (see equation (3.1.7)), 


Sxx/(m— 1) 
Syy/(n _ 1) wv Fm—1n-1- 


Then 
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So, at level a we reject Ho in the ‘upper tail’ test when 


These tests are determined by the upper a@ points of the corresponding 
F-distributions. We also can use a two-tailed test where Ho is rejected if, say, 


n—1)Sax m—1)S; 
Mee DBs pt EUS spk He) 


ee (n —1)Sex 


(n= 1)Spy m—1n—1(0/2) or 


The particular choice of the critical region can be motivated by the form of 
the graph of the PDF ff,,,,, for given values of m and n. See Figure 3.3. 

The F-test for equality of variances was proposed by Fisher. The statistic 
(n —1)Sz2/((m —1)Syy) is called the F-statistic. 


Remark 4.7.2 We can repeat the routine for the situation where alter- 
native Hj, instead of being a7 4 0, is o7 > co} (in which case we speak of 
comparison of variances). Then the GLR Ay,.4, is taken to be 


LL 


(aay esa mr 2ayn/2 if 1 1 gS 
Syy (m+ n)(mtn)/2? yy 


1 1 
1, if —Syr < —Syy, 
1 hn ra ra yy 


and at level a we reject Ho when 


(n — 1) Sex 


AME ee + 
(m ~ 1S, > Pm—1n—1(@): 


A similar modification can be made in other cases considered above. 


The F-test is useful when we have X from Exp(A) and Y from Exp(j) 
and test Ho: A = py. In the context of a GLRT, the F-test also appears when 
we test Ho: “1 = +++ = Un, where X; ~ N(L, a”) with the same variance o?. 
Example 4.7.3 To determine the concentration of nickel in solution, one 
can use an alcohol method or an aqueous method. One wants to test whether 
the variability of the alcohol method is greater than that of the aqueous 
method. The observed nickel concentrations (in tenths of a percent) are 
shown in the following table. 
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Alcohol method | 4.28 4.32 4.25 4.29 4.31 4.385 4.32 4.33 
4.28 4.27 4.38 4.28 

Aqueous method | 4.27 4.32 4.29 4.30 4.31 4.30 4.30 4.32 
4.28 4.32 


The model is that the values _X1,...,X 2 obtained by the alcohol method 
and by the aqueous method are independent, and 


XG N(x, ox), ; Yj mein, (hy, oY). 
The null hypothesis is Ho: oy — ce against the alternative Hy: ae > Oo. 
Here 
m=12, = 4.311, S,, = 0.01189, 


and 
n= 10, ¥ = 4.301, Sy, = 0.00269. 
This gives 
(Spe f ll) / (Sif 9) = 3607. 


From the x? percentage tables, 7, 9(0.05) = 3.10 and ~t1,9(0.01) = 5.18. 
Thus, we reject Ho at the 5 percent level but accept at the 1 percent level. 
This provides some evidence that o% > o% but further investigation is 
needed to reach a higher degree of certainty. For instance, as 7} (0.025) ~ 


3.92, Ho should be accepted at the 2.5 percent level. 


Non-homogeneous normal samples 


Here X has X; ~N (;, 07), with parameters varying from one RV to another. 


(a) Testing equality of means, known variances Assuming that o?,...,02 


are known (not necessarily equal), we want to test 
Ay: fi =-+++ = Un against Hy: py; € R are unrestricted. 
Again use the GLRT. Under Hj, the likelihood is 


2 2 be 1 Dyc8 
Flops. tnioh--.92) = (So) aes 5 (ti ~ #4) {o% | 3 


a 


maximised at fi; = x;, with the maximal value (27)-"/? / (Tpes): 
Under Ho, the likelihood 


2 2 ae 1 Do 
F(X; Ml, OT, -- On) = (-) IL 0: exp D S "(ai - ») /0; 
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attains the maximal value 


Then the GLR 


1 oe 
A Hy ;Ho = XP E So (xi - n/a 


a 


We reject Hp when the sum 7,(a; — 7)?/0? is large. More precisely, 


X; — fi(X))? 
2m Minty = YEO 


4 7 
(another case where the Wilks Theorem is exact). So, Ho is rejected at level 
a when >>, (x; — fi(x))?/0? exceeds h;*_,(a), the upper a point of y?_,. 

(b) Testing equality of means, unknown variances Consider the same null 
hypothesis Ho: wy = +--+ = fy and alternative H: 4; unrestricted, in the case 
of unknown variances (equal or not). Then, under H;, we have to maximise 
the likelihood in o? as well. That is in addition to fi; = x;, we have to set 
d; = 0, which is not feasible. This is an example where the GLR routine is 
not applicable. 

However, let us assume that normal sample X has been divided into groups 
so that at least one of them contains more than a single RV, and it is known 
that within a given group the means are the same. In addition, let the 
variance be the same for all RVs X;. Then one uses a routine called ANOVA 
(analysis of variance) to test the null hypothesis that all means are the same. 
In a sense, this is a generalisation of the test in subsection (a) above. 

So, assume that we have k groups of n; observations in group 2, with 
n=ny+::++ nz. Set 


Xig= Mita, Jal,...,m, t=1,...,k. 


We assume that j1,..., Uz are fixed unknown constants and 6; ~N(0, a), 
independently. The variance o? is unknown. The null hypothesis is 


Ao: fi =-++ = bk = Bb, 


and the alternative H, is that the yj; are unrestricted. 
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Under Hj, the likelihood 


1 ts : 
(2n02)n/2 PP | gee (5 ~ Hi) 


i=1 j=l 


attains its maximum at 
a AQ 1 = 2 h = _ 1 
peo a (Ge a) where 2S 
arr ey 


The maximum value equals 


—n/2 
1 _ 7 
ane Duley — Bie)? ] eM. 
ar) 
It is convenient to denote 


3, = ss Soy — Hi+)?; 
at J 


this sum is called the within group sum of squares. If we assume that at 
least one among numbers n1,...,m%, say Ni, is > 1, then the corresponding 
term >) j(Xij — Xj)? is > 0 with probability 1. Then the random value 
Ss, = oy (Xi; _ cen of the within group sum of squares is also >0 with 
probability 1. 

Under Ho, the likelihood 


1 eee 5 
(Q702)"/2 | age DD (is — #) 


i=1 j=l 


is maximised at 
Beles ~2 1 earecise 48 
=T4, = — (zig — B44) 
at gj 


and attains the value 
—n/2 


1 n 
an Yiey— eee] oo, 
tj 


where 


a 1 
i be est ) Vij. 
nN <— 
tJ 
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We write 


So = Sy ee 


ov) 


this is the total sum of squares. 
The GLR is 
80 n/2 
Ath ;Ho = ra) 
Write so in the form 


k Ny 
» Se (ij — Big + Fig — B44)? = (ei — Bi)? + De ni(Zi4 — E44)" 


i=1 j=l ij i 
(the cross-term sum vanishes as }?(2jj —Zi+) = 0 for alli =1,...,k). Then 
write 


so = > mi(Big — B44)" 


t 


this sum is called the between groups sum of squares. 
Hence, so = 51 + s2, and 


89 n/2 
Ay ;Ho = (1 Ls 2) . 


So, we reject Ho when s2/s, is large, or equivalently, 


82/(k— 1) 


s1/(n—k) 


is large. 

Now the analysis of the distributions of EMO eae and X,, is broken up 
into three steps. 

First, for all 7, under both Hp and Ay: 


ie ees 
o Yo (Xi — Kit)’ ~ xa, 
j=l 


according to the Fisher Theorem (see Section 3.5). Then, summing indepen- 
dent x? distributed RVs: 


S 1 _ 
3 = te Se (Xij — Xi) ~ XR ke 
at gj 
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Next, for all i, again under both Ho and Hj, 
Ni 
— 2 = 
S- (Xi; _ Xi+) and X iy 
j=l 
are independent; see again the Fisher Theorem. Then 
= 42 — = x9 
Si = S> (Xy — Xin) and Sp = S > (Xin — X44) 
ij i 
are independent. 
Finally, under Ho, the X;; are ID N(,07). Then 


= eG i 9 9 
X44 ~N[(p,—] and (X44 —p)* ~ x7. 
n o 
Also, the X;, are independent N(pu,07/n;) and nj (Xi4 — pu)? /o? ~ x2, 
Hence, 
Wi bee 2 
De gt (Kit — wy ~ Xb 
i 
Moreover, X,, and X;,—X44 are independent, as they are jointly normal 


and with Cov (X 4,X44—-Xy ) = 0. Writing 


ase 1 at ok = 
ye (Xin - nu)” = = om (Kin Pe |e 5 (X44 —n)°, 
i 


o? 
a 


we conclude that on the RHS, 
@ ame i+ ~ Koa) = = os Xh-1 and 5 (X44 = yw)” ™ Xi 
independently. On the other hand, write S2 in the form 
a= Dale [Xia — oe) + (oe — B) + (B- X44)]°, 
where ji = D>, ini i n. Then a straightforward calculation shows that 
B52 = (k— lo? +S nil te 
i 


We conclude that S2 under Hy, tends to be inflated. 
All in all, we see that the statistic 


is ~Fx_-1,n—z under Ho and tends to be larger under H;. Therefore, we reject 
Ho at level ~ when the value of Q is bigger than PRK ()s the upper 
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a point of Fx_-1n—-%- This is summarised in the following table. 


Source of Degrees of Sum of Mean square 
variation freedom squares 

Between groups | k — 1 89 s2/(k —1) 
Within groups | n—k S] s1/(n—k) 
Total n—-1 Th) 


Example 4.7.4 In a psychology study of school mathematics teaching 
methods, 45 pupils were divided at random into 5 groups of 9. Groups A 
and B were taught in separate classes by the normal method, and groups 
C, D and E were taught together. Each day, every pupil from group C 
was publicly praised, every pupil from group D publicly reproved, and the 
members of group E ignored. At the end of the experiment, all the pupils 
took a test, and their results, in percentage of full marks, are shown in the 
following table. 


A (control) 34 28 48 40 48 46 32 30 48 
B (control) 42 46 26 38 26 38 40 42 32 
C (praised) 56 60 58 48 54 60 56 56 46 
D (reproved) 38 56 52 52 38 48 48 46 44 
E (ignored) 42 28 26 38 30 30 20 36 40 


Psychologists are interested in whether there are significant differences 
between the groups. Values Xj; are assumed independent, with Xj; ~ 
N(u;,07), i = 1,...,5,9 = 1,...,9. The null hypothesis is Ho: yy; = py, 
the alternative Hy: j; unrestricted. 


Total xz Di (tig a iL)" 


A 354 39.3 568 
B 330 ~~ 36.7 408 
C 494 54.9 192.9 
D 422 46.9 304.9 
E 290 32.2 419.6 
Source of Degrees of Sums of MS F-ratio 
validation freedom squares 
Between groups 4 2896.56 722.869 722.869/47.33=15.273 
Within groups AO 1893.3 47.33 
Total 44 4784.78 


From the results we draw up two tables. The observed value of statistic 
Q is 15.3. As 4. 49(0.001) = 5.7, the value falls deep into the rejection 
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region for a = 0.001. Hence, Ho is rejected, and the conclusion is that the 
way the children are psychologically treated has a strong impact on their 
mathematics performance. 


Statisticians are fond of curvilinear shapes and often own as a pet a large 
South American snake called ANOCOVA 

(Analysis Of Covariance). 

(From the series ‘Why they are misunderstood ’.) 


Nowadays in any Grand Slam you will be tested by an -Ova. 
(A chat at a Wimbledon women’s final.) 
(c) Testing equality of variances, known mean Now assume _ that 


[1 = +++ = Un = wis known. Let us try to test Ho: 0? = --- = o? against 
Ay: OH > 0 are unrestricted. Under Ay, the likelihood is 


2 2 eae 1 27/2 
Floinso}, 08) = (=) fea 5 (ti — #) /0; : 


It attains the maximal value 


1 —n/2 
1/2© 
(27)”/? [TT ;(2i — 1)”] 
at G? = (a; — y)?. Under Ho, the likelihood is maximised at 7 = 7, (xj — 
yw)? /n and attains the value 


: atl, 
(2m)"/2 [D(a — w)?2/n]"”? 


This gives the GLR 


7 [],(2i - 2)? ‘ 
Am ;Ho = { [So (2: a | : 


The difficulty here is in finding the distribution of the GLR under Ho. 
Consequently, there is no effective test for the null hypothesis. However, the 
problem is very important in a number of applications, in particular, in mod- 
ern financial mathematics. In financial mathematics, the variance acquired 
a lot of importance (and is considered as an essentially negative factor). 
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The statistician’s attitude to variation 

is like that of an evangelist to sin; 

he sees it everywhere to a greater or lesser extent. 
(From the series ‘Why they are misunderstood’. ) 


4.8 Linear regression. The least squares estimators 


Ordinary Least Squares People. 
(From the series ‘Movies that never made it to the Big Screen’.) 


The next topic to discuss is linear regression. The model here is that a 
random variable of interest, Y, is known to be of the form Y = g(x) + «, 
where (i) € is an RV of zero mean (for example, € ~ N(0,07) with a known 
or unknown variance), (ii) g(x) = g(x,@) is a function of a given type (e.g. 
a linear form 7 + 6x or an exponential ke” (made linear after taking logs), 
where some or all components of parameter @ are unknown) and (iii) x is 
a given value (or sometimes a random variable). We will mainly deal with 
a multidimensional version in which Y and «€ are replaced with random 
vectors 

Y% €1 

De : and e=]| : ; 
Yn En 


and function g(x) with a vector function g(x) of a vector argument x. The 
aim is to specify g(x) or g(x), i.e. to infer the values of the unknown param- 
eters. 

An interesting example of how to use linear regression is related to the 
Hubble law of linear expansion in astronomy. In 1929, E.P. Hubble (1889-— 
1953), an American astronomer, published an important paper reporting 
that the Universe is expanding (i.e. galaxies are moving away from Earth, 
and the more distant the galaxy the greater speed with which it is reced- 
ing). The constant of proportionality which arises in this linear dependence 
was named the Hubble constant, and its calculation became one of the cen- 
tral challenges in astronomy: it would allow us to assess the age of the 
Universe. 

To solve this problem, one uses linear regression, as data available are 
scarce (related measurements on galaxies are long and often not precise). 
Since 1929, there have been several rounds of meticulous calculations, and 
the Hubble constant has been subsequently re-estimated. Every new round 
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of calculation so far has produced a greater age for the Universe; it will be 
interesting to see whether this trend continues. 

We mainly focus on a simple linear regression, in which each component 
Y; of Y is determined by 


Y,= y+ Cxj+e, i=1,...,n, (4.8.1) 
and €1,...,€, are IID, with Ee; = 0, Var e; = o?. Here y and £ are unknown, 
while x = (#1,...,¥%p) is a given vector. RVs «; are considered to represent 


a ‘noise’ and are often called errors (of observation or measurement). Then, 


of course, Yj,..., Yn are independent, with EY; = y + Ba; and Vare = o?. 


A convenient reparametrisation of equation (4.8.1) is 


Y¥;=a+ B(a;-@)+e, t=1,...,n, (4.8.2) 
with 
1 
C= yt a=7+ 8% and d(% - 2) = 0. 
a a 
In this model, we receive data (21, y1),---,(%n, Yn), where the values x; 


are known and the values y; are realisations of RVs Y;. Determining values 
of 7 and @ (or equivalently, a and 3) means drawing a regression line 


y= 7+ 62 (or y= at B(e—Z)), 


that is, the best linear approximation for the data. In the case y = 0 one 
deals with regression through the origin. 

A natural idea is to consider the pair of least squares estimators (LSEs) 
@ and B of a and £, i.e. the values minimising the sum 


So lye - Bei -2)P. 


This sum measures the deviation of data (21,41), -.-, (n, Yn) from the at- 
tempted line y = (a — 8%) + Bx. By solving the stationarity equations 


2 to — Alar —B)P =0, 


we find that the LSEs are given by 


@=7, B= —”. (4.8.3) 
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Here 
Soy = Y(vi- Ti = Dei — B(vi - 9), (4.8.5) 


with ¥ = aQ— BE = J—SryE/Szrz. We obviously have to assume that Sz. 4 0, 
i.e. not all x; are the same. Note that 


y= 7+ Br, 


i.e. (%, Y) lies on the regression line. Furthermore, @ and £ are linear functions 


OF Wis ses Ui: 
The last remark implies a number of straightforward properties of the 


LSEs, which are listed in the following statements: 


(i) E@=a, EB =8B. 

(ii) Var@ = 02/n, Var B = 0?/Sve. 

(iii) Cov(@, 8) =0. 

(iv) The LSE (a@,() gives the best linear unbiased estimator (BLUE) of 
(a, 8), ie. the least variance unbiased estimator among those linear in 
Vigcway Yau 

In fact, (i) 
Se, uk 
La = =e [a+ B(aj —Z)] =a 
and 
a a 1 ae 
EB = So Dai =z) EY; = So ni —@)[a+ B(a; —)| = 6 
Next, (ii) 
5 
Var @ = —; S— Var Y; = 
and 
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Finally, (iv) let A= 2, GYi and B= >>, GY; be linear unbiased estimators 
of a and £, respectively, with 


S-cila + B(x; —%)] =a, and do dile + B(x; —%)] = B. 


a 


Then we must have 
Soci a Oe S > ci(2i — B) = 0, a =0, $0 di(ai —%) =i) 


Minimising Var A = 0? >>, c? and Var B = 0? >>, @? under the above con- 
straints is reduced to minimising the Lagrangians 


Fh Co, ey oe a= Soc -d (Se) =o ae Xj — Z) 


and 


Lo(digs.<, dy; pH 2a AGP (Sae-m-1), 


The minimisers for both £; and £2 are 


= sgl + u(os — 2) 


Adjusting the corresponding constraints then yields 


Cj =d;= 


1 
Sa aad qe 
Cj an i oi 


ie. A=G@ and B= 8. 


Example 4.8.1 (Linear regression without constraints.) Here we assume 
that the reader is familiar with elements of linear algebra. Consider the so- 
called general linear model Y = X6 +c, where Y,e € R", X is an n x p 
matrix and Ee = 0. We suppose that rank(X) = p,n > p. Usually, we select 


X10 = ©99 = ++: = Lap = | in examples. Then we obtain the model 


Y; = Bo + Piva +--+ + Bp-1%ip-1 + &,4 = 1,2,...,n 


Let 6 = X. The error ete = |[Y — O||? is minimised over all 6 € R[X], 
where R[X] stands for the linear space generated by all the columns of the 
matrix X. We want to find the least squares estimate (LSE) 6 such that the 
vector Y — @ is orthogonal to R{X]. This means 


xT(y —6)=0 
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This equality defines the LSE; by substituting = XB we have 
XTY = X™Tx@, (4.8.6) 
6 = (XTX) 'xTy, (4.8.7) 
6 =X = PY, P = X(X™X) "XT. 


The matrix P is idempotent, i.e. P? = P. The LSE @ is unbiased: 


£0 = (X™X)!XTE[Y] = (X™X)'1X™X6 = 8B. 


We also present an alternative argument. Let « = Y — X(. Write the 
so-called residual sum of squares (RSS): 


efe=(¥ — Xp)" (¥ = Xp) = YTV —26T xT y+ 67x" xg. 
By differentiation we get 
XTX6=XTY. 


We are left to check that this extremal point is a minimum. This follows 
from the identity 


(¥Y—X8)"(Y—X8) = (Y—X)"(Y—X#)+(8-6)"XTX(B—B). (4.8.8) 


Next, we introduce an idempotent matrix P = X(XTX)—!XT to proceed 
with the proof of (4.8.8). Let I be the unit matrix. Let us check the identity 


(I=P)X =X—X(K'X) '1X?X =XK-—X=0. 
Next, Y — X@ = (I— P)Y. Hence, 


(Y-X8+X(B- 8) "(¥-XG+X(B-B)) _ 
= (¥ —X6)"(¥ — XB) + (8 — B))*X*X(B — B)), 
as the intermediate terms cancel each other. Say, 
(Y — X4)"(8 - 6) = Y"(— P)™X(6— 8) =0 


because PT = P. This implies that B is a minimiser as the remaining term 


is non-negative. 


Example 4.8.2 In fact, the linear regression is a special case of a more 
general linear model. Let Y; = 69 + 612%; + 6, 7 = 1,...,n, with HD «&, 
Ke; = 0. We check that the LSE estimators are 
ee ae re pw Oe Sle) 
0—- 4;"1— =\2 ’ 
ile — &) 
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and the residual sum of squares (RSS) 
YG - R= 0-H - 7, 


where Y; = 69 + 612; and r? is a sample correlation coefficient 


[E. @%-F)(e:-2)) 
wilh — ¥) eile — a). 


r= 


Here X = (ie) and 


ie RE = 1 130,27 —2 
x™x=(" ae ); (x? xX) 1 = —_—_,, ( nl ). 


The formula 


Hence, ¥ =Y Br (2x3 — #), and 


RSS = 57 (4 - YP = - 07) - FY. 


a 
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Statisticians must stay away from children’s toys 
because they regress easily. 
(From the series ‘Why they are misunderstood’.) 


Note that so far we have not used any assumption about the form of the 
common distribution of errors €1, ..., €n- If, for instance, e; ~ N(0, 07), then 
the LSE pair (@, 3) coincides with the MLE. In fact, in this case 


Y; ~ N(a ae B( x5 = m)ger), 
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and minimising the sum 
Solio Ble — 2)? 
i 


is the same as maximising the log-likelihood 
(x, y; 1,07) = — 5 In( 2707) = a9 v2 Dl ,— 2). 


This fact explains analogies with earlier statements (like the Fisher Theo- 
rem), which emerge when we analyse the normal linear regression in more 
detail. 
Indeed, assume that €; ~ N(0,07), independently. Consider the minimal 
value of the sum )>,[Y; — a — 8(a; — ®)]?: 
2 


R= sy ~@—Blx;-2)| . (4.9.1) 


R is called the residual sum of squares (RSS). The following theorem holds: 


(i) a~ N (a, a? /n); 
(ii) B~ N(B,0?/See); 
R/o? yy ee 


(iv 8 and R are independent; and 


(v 


To prove (i) and (ii), observe that @ and 8 are linear combinations of 


) 

) 
(iii) 

) a, 

) R/(n — 2) is an unbiased estimator of o?. 


independent normal RVs Y;. To prove (iii) and (iv), we follow the same 
strategy as in the proof of the Fisher and Pearson theorems. 
Let A be an n Xx n real orthogonal matrix whose first and second 
columns are 
L/vin (21 -2)/ Va 


: and : 
I/Jn (tn — ©)/ VSoe 


and whose remaining columns are arbitrary (chosen to maintain orthogo- 
nality). Such a matrix always exists: the first two columns are orthonormal 
(owing to )7,(%;—%) = 0), and we can always complete them with n—2 vec- 
tors to form an orthonormal basis in R”. Then consider the random vector 
Z = A'Y, with the first two entries 


= /nY = V/na ~ N(Vna, 0”) 
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and 


Gia = Dew BV; =4/ Seok NG San 8?) 


In fact, owing to the orthogonality of A, we have that all entries 71,... 


are independent normal RVs with the same variance. Moreover, 
n n 
y= Sove 
i=1 i=1 
At the same time, 
n n n 
S22 =>" 2}- 23-2 = 3"? — na? — Sen8? 
i=3 i=1 i=1 


a dM _ X)? + cee —_ 26Sry 


= (% -a-B (:-®)) =R. 


Hence, @, 8 and R are independent. 
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If matrix A = (Aj;;), then (i) for all 7 > 2, 50, Ai; = 0, as columns 


2,..., n are orthogonal to column 1; (ii) for all j > 3, 50; Aij (ai — 


EZ5 = d BY; Aij = dc + B(x; — &)) Aaj 
= 0D Ay +8 3e Ai =0 


and 


hs od > ry” 
Var Z; =0 ) aj, =o. 
i 


Hence, 


Z3,..., Zn, ~ N(0, 07) independently. 


T) = 0, 
as columns 3, ..., n are orthogonal to column 2. Then, for all 7 > 3, 


But then R/o? ~ x2_». Now (v) follows immediately, and ER = (n — 2)o?. 


It is useful to remember that 


R= SY? — nd — San 8?: 


(4.9.2) 
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Now we can test the null hypothesis Ho: 6 = 6 against Hi: G6 € R 
unrestricted. In fact, we have that, under Hp, 


2 
~ 1 
BwN (6 =) and att ~ ee independently. 
Then 


ey (4.9.3) 


Hence, an a size test rejects Ho when the value t of |T| exceeds tn_2(a/2), 
the upper a/2 point of the tn—2-distribution. A frequently occurring case is 
Bo = 0 when we test whether we need the term G(x; — %) at all. 

Similarly, we can use the statistic 


P= a— ag ay 
VR/\/(n — 2)n 


to test Hp: a = ao against Hy: a € R unrestricted. 
The above tests lead to the following 100(1 — y) percent CIs for a and (: 


ps (4.9.4) 


( = nal 2), a + eee) (4.9.5) 
= VR VR 
( = eg); B+ eet tr. (4.9.6) 


A similar construction works for the sum a+ @. Here we have 


a 1 1 
at+Bw N(a + B,o? (- - 3:)) , independently of R. 
Then 
ae a+ 8 - (a+) ee 
de, al ? 
(t+a Ry(n -2)| 
Hence, 


{3 mee) (2 + =) ice 2] ee (4.9.7) 


EG Es (6H) (2 i. =) Ry(n 2) : (4.9.8) 


gives the 100(1 — y) percent CI for a+ (. 
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Next, we introduce a value x and construct the so-called prediction interval 
for a RV Y ~ N(a + G(x — %), 07), independent of Y,,...,¥p. Here, ¥ = 
a nalts L1,-.--,2%n are the points where observations Y),..., Y;, have been 
made, and 2 is considered as a new point (where no observation has so far 
been made). It is natural to consider 


Y =@+ A(x —-7) 


as a value predicted for Y by our model, with 


EY =a+ B(x —Z) =EY, (4.9.9) 
and 
Var (Y — 1) = Var Y + VarY 
= Var a+ (x —%)? Var B + 0? (4.9.10) 
=)2 
2 +24 257) |. 
Hence, 


“a ey? 
P-¥ ~N (0,07 c og ee) |). 
nr Dae 


and we find a 100(1 — y) percent prediction interval for Y is centred at Y 


and has length 
fs 1 (@—2)? a 
moyfi4 aes tna (2). 


Finally, we construct the CI for the linear combination a+ 1G. Here, by a 


similar argument, we obtain that 


a4 ties l 
( B mE an! a(1/2).+ 18 (4.9.11) 


Hea [b+ Et-ol0/2)) 


is a 100(1 — y) percent CI for a+ 1G. 
In particular, by selecting 1 = x — Z, we obtain a CI for the mean value 
EY for a given observation point x. That is, 


a+ B(x —Z), == oa. 


A natural question is: how well does the estimated regression line fit the 
data? A bad fit could occur because of a large value of a”, but a? is unknown. 
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To make a judgement, one performs several observations Yj1,..., Yim, at 
each value 2;: 


Yj =at B(aj-Z)+ej, jul,...,m,i=1,...,n. (4.9.12) 


Then the null hypothesis Ho that the mean value EY is linear in x is tested 
as follows. The average 


Wee 
Yun=—) Ny 
Mi * 
j=l 
equals 


2 
a+ B(xi =~ 7) + E+, where Ej4 ~aN (0 =) . 
4 


Under Ho, we measure the ‘deviation from linearity’ by the RSS 
So mi(Vis — @ — Bai - 8)’, 
i 

which obeys 

1 = Oe se _— 

=) se mi(Yix — & — B(x; — 2)? ~ xp_2- (4.9.13) 

i 

Next, regardless of whether Ho is true, for alli =1,...,n, the sum 


1 = 
52 ee = Vig)? ness 
j 


independently of Y1,, ..., Yn. Hence 
1 —_ 
aa (Vij - Vien oe je ene (4.9.14) 
1,9 
independently of Yj, ..., Yn 4. Under Ho, the statistic 


= se 2 
imi [Vis —@— Bei —2)] /(r—2) 
dig Vig — Yir)?/ 0, mi — 2) 
Then, given a € (0,1), Ho is rejected when the value of statistic (4.9.15) is 
greater than (Ae rem ae (2) 


LY, Fr—2ymi+--+mn—n- (4.9.15) 


Example 4.9.1 (Linear regression with constraints.) Consider the model 
of regression with the linear constraints AS = 0 or, using the rows of the 
matrix A,a;G = c,i=1,...,q. The Lagrange function takes the form 


L=eet(palt=c)x. (4.9.16) 
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The optimality condition oF = 0 means that 


-2xXTy +9xK?xp+ ATA =0. (4.9.17) 
By solving (4.9.17) with constraints AG = c we obtain 
By = B- 5(K™X) AT 
— Shu = [A(KTX) MAT] Yo — AB), 
Collecting these formulas together we get 
Bu = B+ (X™X)1AT[A(XTX) AT] (c — AB). (4.9.18) 
Next, we find 
RSS = (c— AB)™[A(XTX)!AT]“l(c — AB). 


Our goal is to check the hypothesis Ho, i.e. AG = c using the Fisher ratio 


RSS — RSS)/q 


oe 
= RSS[(n — 0) 


(4.9.19) 


The Fisher statistics F’ has the distribution Fyn—p. In the case c = 0 it may 
be written in the form 


n—pY'™(P-—Pr)Y 


F= 
p Y'(I—P)Y 


(4.9.20) 


Here the matrix Py is symmetric and idempotent, and PPy = PyP = Py. 
The matrix P stands for the projector on the linear space R[X] generated by 
the columns of the matrix X, and the matrix Py stands for the projection 
on the intersection R[X]M {Af = 0}. 


Example 4.9.2 Consider an observation model 


Y=a1+€1 
Yo = 2a, — ag + € 


¥3 = aj + 2a9 + €3, 


where € ~N(0,07I). Find the Fisher statistics for checking the hypothesis 
Ho 7A, = 2. 
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In this example 


= 1 ae 1 
ay = race + 2Y2+ Y3),@2 = 5 (2¥s — Y2) 
RSS =Y?+Y?+Y? — 6aj — 543. 
Here n = 3,p = 2,q = 1 and 
11 
A(X?X)“AT = 39° AG = 1 — Go. 

The Fisher statistics has the form 
(Aa@)T A(XTX)-1AT]-1 Aa = (@1 — G2)? 


qS2 ~ 1152/30 


where S? = RS'S/(n — p) = RSS. Here F ~ Fyn—p = Fin. 
Alternatively, aj = ag = a under hypothesis Ho. So, 


ete = (Y, — a)? + (Yo — a)” + (¥3 — 3a)? 
id (Yi + Yo + 3Y3) 


and ay = FI . Hence, 
RSS = (Yi — Gy)” + (Yo — Gx)? + (¥3 — 34x), 
RSS — RSS 
and the Fisher statistics F = ——2——"». Finally, straightforward alge- 


braic manipulations demonstrate that two expressions for RSS are equiv- 
alent: 


RSSy = (Yi — @y)? + (Yo — Gx)? + (¥3 — 3G y)° 
30. 


= Ti Qo)? + Y? + Y¥? + Y? — 6a? — 5a2. 


5 


Cambridge University Mathematical 
Tripos examination questions in 
IB Statistics 


Statisticians will probably do it. 
(From the series ‘How they do it’.) 


The manipulation of statistical formulas 
is no substitute for knowing what one is doing. 
H.M. Blalock (1926-1991), American social scientist. 


Question 5.1 (a) Let X1,...,X, be IID RVs with distribution 
P(X =r) =p" 10—p); r=1,2,.-- 


where the parameter p,0 < p < 1 is unknown. Find a sufficient statistics T 
for p. 
(b) Find an unbiased estimate for p which is a function of T. 


Solution (a) A likelihood function takes the form 


n 


n 4-1 (1 = p)” ait +2en 
L(x, p) = (1 —p) Ile Sr anit: ’ 
SOS ite ee ey ay ae 
Hence, T(x) = ><; 2; is sufficient statistics. 
(b) A simple unbiased estimate of parameter p looks as follows 


p25. I(X; = 1). 
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Indeed, 


1 
3 a asker a a eae 
— i 


3 


Sire 


Ep =1- 


For n = 1 the estimate p is already a function of T,. For n,t = 2,3,..., 
n <t define 


p* =E(f|T = t) = (1-297 10% = 0]%=2) 
i=1 
i Po f= 1 
= es P(xi= Alf, =1)=1- (779) /(F7 3) 
= 


(with assumption (3) = 1 for t = n). So, the Rao—Blackwell statistics 


n—-2 
takes the form 
Tr -—7n 
Teed 


p* =E(I(X1 > 1)|Tr) = 


In fact, 
PX = 1 Tae ta) eh py Paya Sat = 1) 


Ph, =) i P(T, = t) 


(op) 3 po 


b) 


PGS) = 


P(T,-1=t—-1)= 


n—-1 
P rite+rn—1=t-1 


Finally, we check that 
t—2 t-—1 
ye eae ea) 
ritetrn—1=t-1 ritetrn=t 


This formula is well known in combinatorics in the following context: find 
the number of placements of s—1 pigeons into m— 1 nests such that no two 
pigeons are placed into the same nest (see the diagram below). 
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Question 5.2 Derive the least squares estimators @ and G for the coeffi- 
cients of the simple linear regression model 


¥Y,=a+B(a,-T)+e, t=1,...,n, 


where 21,...,2n are given constants, 8 = n-!°?_, a,1 =1,...,n, and & 
are independent with 


Ee, =0, Var(e;) = 0, ea Ree 


A manufacturer of optical equipment has the following data on the unit cost 
(in pounds) of certain custom-made lenses and the number of units made in 
each order. 

No. of units,7; 1 3 5 10 12 

Cost per unit y, 58 55 40 37 22 


Assuming that the conditions underlying simple linear regression analysis 
are met, estimate the regression coefficients and use the estimated regression 
equation to predict the unit cost in an order for 8 of these lenses. 

Hint: for the data above, Sry = )77_,(#i — )y; = —257.4. 


Solution The LSE & and B minimise 


S= Dw —a— B(x; -2))° 
and thus solve the equations 


yi —na=0, 
BY ai —2) =S (ei -2)y = > (i - 2-9); 


i i 
in other words, @ = 9, = Szy/Szre (this stationary point is obviously a 
minimum). We compute 


i 212 
ay = 5 (58 + 55 + 40 + 37 + 22) Nor = 42.5, 


1 


Son = 5.27 + 3.27 + 1.27 + 3.87 + 5.87 = 86.8, 


as a result, @ = 7 = 42.5,a@ = a = ars ~~ —2.965, and we get y(x) = 
42.5 — 2. 965(c — 6.2). For x = 8, we obtain y(8) + 37.063. 
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Question 5.3 Suppose that six observations X1,...,Xn are selected at 
random from a normal distribution for which both the mean fx and the 
variance o% are unknown, but it is found that Sy = yo (a: — £)? = 30, 
where © = tye x;. Suppose also that 21 observations Y;,..., Yo, are 
selected at random from another distribution for which both the mean py 
and the variance ee are unknown, and it is found that Syy = = 40. Derive 
sane the likelihood ratio test of the hypothesis Ho : 0% = o? against 
Hy: 10% > Ge and apply it to the data above at the 0.05 level. 
Aint: 
Distribution x? NG on Pen F520 F621 
95th percentile 11.07 12.59 31.41 32.68 2.71 2.57. 


Solution X1,...,Xm ~ N(ux,0%), Yi,...,Y¥n ~ N(uy, 0%). Under Ho, 
ox = Oe and thus 
Ley(Ho) = = sup —fx(@ | ux, 07) fy (y | uy, 07) 


OX =7Y MX by 
1 Sara + Syy ) 


(2r0? )(rtm)/2 oxp ( 202 


Now observe that for g(x) = 2%e~°”, (a,b > 0) we have 


sups>09(t) = 9(a/6) = (F) e*. 


ep 


b 
Hence, 
1 Soe +S. 
— (mtn) /2 oP a UY: 
Ley(Ho) = (2a2)ontnj/2© » 99> m+n 
Similarly, under H,, 
Loy(Hi)= sup fx(# | ux, o%) fry | wy, oF) 


OX OY MX HY 


1 _ m+n 
— 2 


(2767) *Ore?, )nja° : 


with 62, = Sez, 2 = Suy provided only 62. > 62). As a result, 
x m Y n xX Y 


m+n m 


‘ 2 ee if Soa o Sw 


1, otherwise; 


for Ges > 7, we have 


2log A = Const +g (=) 
Syy 


Tripos examination questions in IB Statistics 373 
with 
g(z) = (m+n) log(1 + z) — mlog z, 


then g/(z) = te me 242) ie. gt for z > ™ = reject Ho if a is large. 


Under Ho, = = Go = = o7, and 23 Sug mee ‘Sy ay ome ae 


therefore, 


Sxx/(m— 1) " 
Sy 


For the data above, the LHS equals 


30/5 


=3>2.71, 
40/20 


(from F599) and thus Ho has to be rejected at this level. 


Question 5.4 Let X1,...,Xn be a random sample from the N(6,07), 
and suppose that the prior distribution for @ is N(j, 77), where a7, j1, 7? are 
known. Determine the posterior distribution for 0, given X,,...,Xy, and 
the best point estimate of 6 under both quadratic and absolute error loss. 


Solution We have 


and thus 


m(O |X) ~ fx(X | 0)r(0 
~ exp (-ds SG 6)? = (0 “) 


Na 


i=1 
o2+nt? por +77 >>, a4 ° 
~ ex 
eg 20°T* o? + nr? 


o2+4+nt2  ? o2+nr2 


2 2 
We deduce that 7(@) ~ N (“ pt De gers . By symmetry of the pos- 
terior distribution, the best point estimate of # in both cases reads 


yo? + 77 0; Bi 


o2+nr? 
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Question 5.5 An examination was given to 500 high-school students in 
each of two large cities, and their grades were recorded as low, medium, or 
high. The results are given in the table. 


Low Medium High 
City A 103 145 252 
City B 140 136 224 


Derive carefully the test of homogeneity and test the hypothesis that the 
distributions of scores among students in the two cities are the same. 
Hint: 
Distribution Me x3 x? x2 
99th percentile 6.63 9.21 11.34 15.09 16.81 
95th percentile 3.84 5.99 7.81 11.07 12.59. 


Solution 


M1 M12 M3 M+ 


nar 22 N23 Na+ 
Mert MQ N43 N44 


Pll P12 P13 
p21 P22 p23 


Under Ho we have: (Ni, Ni2, Niz) ~ Mult(ni+, 71, p2, 3), where p; = pij = 
ba f= 12,3, pe = 1s Next, 


log Like = Const + S- n+; log p; 
=> pj=—4,  |@o/=3-1=2. 


Under Hy: we have (N11, N12, N13, Noi, No2, No3) ~ Mult(n++, pi1,.--, 23), 
Paso 1. Next, 


Wes 
Slash =o | 5;;/Bs;) = 2 “7 | — 
og So iy 08 (Diz /Pij) ny °8 (| 

2 piiahee 
=2 2 ojo = 237 (ey +05 ~ eu) tog (14922) 

C45 ig 
25 ae 

€ij 


as |Q1| — |O2| = 2, we compare 2log A to x3. 
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Application: 


n4; 243 281 476 1000 
ej; 121.5 140.5 238 


O17 — C45 : —18.5 4.5 14; 


(18.5)? (4.5)? (14)? 
2log A = 2 ~ 7.569. 
Te ( ihe W405? Gag 


It is significant at 95 percent level and not significant at 99 percent level 


2log A = 7.569 < x3(0.99) = 9.21 and > y3(0.95) = 5.99. 


Question 5.6 The following table contains a distribution obtained in 320 
tosses of 6 coins and the corresponding expected frequencies calculated with 
the formula for the binomial distribution for p = 0.5 and n = 6. 


No. of heads 0 1 2 3 4 5 6 
Observed frequencies 3 21 85 110 62 32 7 
Expected frequencies 5 30 75 100 75 30 5 


Conduct a goodness-of-fit test at the 0.05 level for the null hypothesis that 
the coins are all fair. 
Aint: 
Distribution x? x? x2 
95th percentile 11.07 12.59 14.07 


Solution Goodness-of-fit test 


heads O E O-E 
0 3 5 —2 


oor wNnr 
on) 
iw) 
“J 
ol 
| 
KH 
w 
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Statistic T = S00 (oj — e;)* /e:, 


30. 75 - 10075 * 30 5 


1 
= 150 (740 + 425 + 538+ 150) = 9.02. 


We have |01| = 6, |Oo| = 6, x2 = 12.59 at 0.05 level, so T < y2, and there 
is no evidence to reject Ho. 


Question 5.7 State and prove the Rao—Blackwell Theorem. 

Suppose that X7,...,X, are independent RVs uniformly distributed over 
(0,30). Find a two-dimensional sufficient statistics T(X) for 6. Show that 
an unbiased estimator 0 is 6 = X,/2. 

Find an unbiased estimator of @ which is a function of T(X) and whose 
mean square error is no more that of (). 


Solution RB Theorem. Let T be a sufficient statistic for @ and let @ be 
an estimator for 6 with Eé@? < co for all 6. Define 


6 =E(6 | T]. 
Then for all 6, 


E[(4 — 4)"] < E[(6 — 4)”). 


The inequality is strict unless 6 is a function of T. 
Proof. By the conditional expectation formula, 


B[6] = E[E(6 | T))] = Ed, 
so 6 and @ have the same bias. By the conditional variance formula, 
Var(0) = E[Var(6 | T)] + Var[E(6 | T)| 
= E[Var(6 | T)] + Var(6). 
Hence Var(@) > Var(0) with equality only if Var(@ | T) = 0. The likelihood 


function is 


n 


To<a<30 1 
f(x | D) = II : a d = (agyr Amini 24 >0,max; ;<30}> 
1 


and hence, by factorisation criterion 


T = (min; X;, max; X;) 
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is sufficient. Clearly, EX, = 26, so 5X 1 is an unbiased estimator. Define 


R 1 
6=E re | tins X= ainda X= B 


Qn Wn’ n 22 °° °& 4 
Consequently, by the RB Theorem, 


a b nm-2la+b a+b 
- = 


~ 1 
O= q (mini X; + max; X;) 


is the required estimator. 


Question 5.8 The fast-food chain McGonagles have three sizes of their 
takeaway haggis, Large, Jumbo and Soopersize. A survey of 300 randomly 
selected customers at one branch choose 92 Large, 89 Jumbo and 119 Soop- 
ersize haggises. 

Is there sufficient evidence to reject the hypothesis that all three sizes are 
equally popular? Explain your answer carefully. 


Distribution ty to tz x? xe x% Fino Fe 
95th percentile 6-31 2-92 2-35 3-84 5-99 7-82 18-51 9-55 


Solution This is a standard chi-square test: 


L J 5 
observed 92 89 119 
expected 100 100 100 


n! 
ith the likelihood —— he 
wi e likelihoo Fina Ll 
7 
The null hypothesis is Ho: p1 = p2 = p3 = 1/3, and the alternative is H;: 


pi > 0,71 = 1,2,3, with p; + po + ps = 1. The chi-square statistic takes the 
value 


8x8+11x11+19x19 | 
100 -_ 


Under Hp the distribution of the statistic is, approximately, y3, and the value 
5.46 is below the 95 percent point. Therefore, there is not enough evidence 


5.46. 


to reject Ho. 


Question 5.9 In the context of hypothesis testing define the following 
terms: (i) simple hypothesis; (ii) critical region; (iii) size; (iv) power; and (v) 
type II error probability. 
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State, without proof, the Neyman—Pearson Lemma. 
Let X be a single observation from a probability density function f. It is 
desired to test the hypothesis 


Ho: f=fo against Hi: f= fi, 


with fo(a) = Ligle@/? and fi(x) = ®'(x), —co < x < oo, where ®(z) is 


the distribution function of the standard normal, N (0,1). 

Determine the best test of size a, where 0 < a < 1, and express its power 
in terms of ® and a. 

Find the size of the test that minimises the sum of the error probabilities. 


Solution (i) A simple hypothesis Hp determines the underlying data dis- 
tribution uniquely. A simple alternative H, is the one that specifies — in 
the case Ho is not true — the data distribution uniquely. Otherwise, one 
speaks of a composite alternative where the data distribution belongs to a 
parametrised family. 

(ii) The critical region of a test gives the set of data values where Ho is 
rejected. In fact, a (non-randomised) test is identified with its critical region. 

(iii) The size of a test is the probability of the type I error, i.e. the prob- 
ability of rejecting the null hypothesis when it is true, i.e. the probability of 
the critical region calculated under the distribution Py). 

(iv) The power of a test is the probability of rejecting the null hypothesis 
when it is not true, i.e. when an alternative holds. In other words, the power 
equals the probability of the critical region calculated under an alternative 
distribution. In general, this is a function of a parameter that specifies the 
alternative data distribution. In the case of a simple hypothesis Ho versus a 
simple alternative H,, the power is just a number and gives the probability 
of accepting H; under the distribution Py, . 

(v) The type II error-probability equals one minus the power. That is, it 
gives the probability of accepting Hg when Ho is not true. 

The Neyman—Pearson Lemma, addresses the case of a simple hypothesis 
against a simple alternative. Here the null hypothesis Ho is that the data 
PMF f coincides with a given PMF fo, while the alternative H; is that 
f = fi, another given PMF. We want to define a best test (the one with 
a maximal power) among all tests of size < a where a € (0,1) is a given 
number. The NP Lemma makes a significant step towards this goal: it states 
that if we take a test with the critical region C of the form 


one Ha? 
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then it will be ‘most powerful’ among all tests of size < @ where a is taken 
to equal Po(C). Unfortunately, it is not guaranteed that in general the test 
of this form exists for any given a € (0,1). (To guarantee this, one has to 
consider randomised tests.) However, if fo and fj are two continuous PDFs 
(on a line R or a Euclidean space R") then for all given a € (0,1) there 
exists a test with a critical region of the above form with a = Po(C). 

In the question, the ratio of the PDFs 


PCs im e121 / On ee. bp 
fo(x) |ale—*?/2 /2 — V2r|z\’ 


c 


2 
That is, the bound fi()/fo(x) > cis equivalent to |x| < a where a = d 
V27C 
To determine the value of a for a given a € (0,1), we equate a and Po(C). 
That is, we write 


1 a 
a= Po(|X| <a) =2- a ce? dy =1—e* ? 
0 


whence a as a function of a is given by 


a4) oia (+). 


This specifies the most powerful test of size a. The power z(a) of this test 
equals 


P\(\X| <a) = G(a) — (0) = 26 ( 2In (=) ie 


Accordingly, the type II error-probability is 1 — m(qa). 
To minimise the sum a+ 1 — (qa), we differentiate in a and equate to 0: 


2 
V2 


and 1—a= e ua)?/2 this yields the value 


“(a +1-—2x(a)) =1- e~ (2) /2@! (a) =U: 
a 


With a’(a) = qoaw 


Afr 


a=1l-e 


Question 5.10 Let X1,...,X, be a random sample from a probability 
density function f(x |@), where @ is an unknown real-valued parameter 
which is assumed to have a prior density 7(@). Determine the optimal Bayes 
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point estimate a(X1,...,Xn) of 8, in terms of the posterior distribution of 
8 given X1,..., Xn, when the loss function is 


L(6,) y(9—a) when @>a, 
»a) = 
d(a—@) whené<a 


where y and 6 are given positive constants. 
Calculate the estimate explicitly in the case when f(z | #) is the density 
of the uniform distribution on (0,4) and 1(0) = e~°0"/n!, 0 > 0. 


Solution The Bayes estimate is the the value a*(x) minimising, in a for a 
given x, the posterior expected loss 


=7 [ -an (6|x woes flan 7(6|x)dd 


a 


As usual, we find the zero of the derivative: 


+oo a 
re eae / x(6|x)d0 +6 / x(0|x)d0 = 0, 


which yields 


100 
In other words, a*(a) gives the -~ 5 Percentile of the posterior distribution. 
ai 


In the example, the prior 7(9) = e~%0"/n!, and the components X; of 
1 
X are IID and ~U(0,9), with the PDF f(2|@) = gt 0 <a < 9). For the 


posterior it gives 


m(O|x) x 1(0 II f(xi|6) 
1<i<n 
see at 6) x 1 de? 
ar [I] 710<2<0) « (max 2; < O)e 
1<i<n 


1.e. 


n(0|2) = 1(0 > maxa,)e™**'—°, with @ — max a; ~ Exp(1). 
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) ee 


Equating, 


we obtain that 


Question 5.11 Let Yj,...,Y, be observations satisfying 
YH 0a +e; 1l<icn, 


where €1,...,€, are IID RVs each with the N(0,07) distribution. Here 
1,---,2%n are known but 6 and o? are unknown. 

(i) Determine the maximum-likelihood estimates (3,62) of (8,02). 

(ii) Find the distribution of B. 

(iii) By showing that Y; — Ba; and B are independent, or otherwise, de- 
termine the joint distribution of 8 and G2. 

(iv) Explain carefully how you would test the hypothesis Ho: 8 = 
against H,: 6 4 fo. 


Solution (i) The likelihood and log-likelihood in question are 


1 
b= aro | sgt Lt = Paj) 


and 


nL =—FIn (2107) ae a2 lh — Bx;)? 


The derivatives 


O 
=m L=-+4 a alu Yy; — Px;) and go 58 + 593 Da — Bx;)? 
vanish at the MLEs 


a LeU; — 1 Be 
Bay) = and (a.v) = + (i - Bai)? 
0D n 
where Sz = >, 22. 
(ii) The RVs Y; ~N(@xi,07), independently, thus 2jY; ~N(@x?,072?), 
again independently. Hence, 


Soy ~N (BSze, oe" Sen and B(x, Y)~N (8, g" | Son\ 
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(ii) The RVs B(a, Y) and Yi- aiB(a, YY) Vs n, are jointly normal. 
Thus, @ is independent of Y — $x if the covariances of 3 and Y; —27;6 vanish 
for al 1 <i <n. But 


Cov (Y; — 218, 8) = Cov AO eee cre — x; Var B 
J 


Lio" Lio" 
= = = 0. 
Sey See 
Also, the ratio > (Y; — 62;)?/o0? ~ y2. On the other hand, 
1<i<n 
Y= (¥i- Bai)? = Yo (V%i-Bait+2i(8—B))? = So (Vi-Bai)?+Sex(8—8)?, 
1<i<n 1<i<n 1<i<n 


where the summands > (Y;— Bx)? and Srx( B — 8)? are independent, and 
<iKn 


<4 
See(B — B)?/0? ~ x4. Hence, 


a2 /o* = S> (Yi - Bx)? /0? ~ x4, 


1<i<n 


independently of B ; 
(iv) To test Ho: 6 = 69 against H,: 6 4 Bo, observe that under Ho: 
V Sra 


a 2 
——(6-—6)~N(0,1), and ae x2_1, independently. 
o o2 n-1 


— 


Therefore, the statistic T(z, Y) ~ tn—1 where 


V/8z2(8 — 8) 


/no?/(n — 1) 


Thus, in a size a test we reject Ho when |7'| exceeds the a/2-tail value of 
the Student t,,_1-distribution. 


f= 


Question 5.12 Let X1,...,X¢ be asample from a uniform distribution on 
(0, 6], where 6 € [1,2] is an unknown parameter. Find an unbiased estimator 
for 6 of variance less than 1/10. 


Solution Set 
VE striae | Gy js oe NG 
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Then M is a sufficient statistic for 0, and we can use it to get an unbiased 
estimator for 0. We have 


Fy (y; 8) = Po(M < y) = Po(X1 <y,...,X6 < y) 


= T]PolX <u) = (Fx(y; 9), O< y <8. 
i=l 


Then the PDF of M is 


d 6y° 
30) = —Fy(y;0) = —,0<y<8 
fu (y; ) dy mM (Y; ) 6° SYS, 
and the mean value equals 
6y> by" |° «60 
EM=/d = = —., 
i; UY 96 ~ 768|, 7 
0 


So, an unbiased estimator for 0 is 7M/6. Next, 


0 
7M 7M \? 7M \? 72 x 6y7 726? 6? 
Var—— =E E = dy—@? = P= —. 
a ( 6 ) ( 6 ) i} 6296 4 8x6 48 


0 


For 6 € [1,2], 
6? i a 
ae Bor peas 
48 48°12 


ie. 07/48 < 1/10, as required. Hence the answer: 


7 
the required estimator = 6 x max[X1,... , Xe]. 


Question 5.13 Let X1,..., X, be a sample from a uniform distribution 
on [0,6], where 0 € (0,co) is an unknown parameter. 

(i) Find a one-dimensional sufficient statistic M for @ and construct a 
95 percent confidence interval (CI) for @ based on M. 

(ii) Suppose now that 6 is an RV having prior density 


m(0) « I(0 > a)o-*, 


where a > 0 and k > 2. Compute the posterior density for 0 and find the 
optimal Bayes estimator 0 under the quadratic loss function (0 — @)?. 
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Solution (i) Set 
M = max [X),..., Xp]. 
Then M is a sufficient statistic for 0. In fact, the likelihood 
: 0 
f(x; 0) = gn’ > Xmax 
0, otherwise. 


So, M is sufficient by the factorisation criterion: we can calculate f(-; 6) if 
9 and 2max are given. 
Now, P9(@ > M) = 1. So, we can construct the CI with the lower bound M@ 
and an upper bound b(M) such that P(@ < b(M)) = 0.95. Write 
P(6 < b(M)) =1—P(0(M) < 6) =1—P(M <b"'(6)) =1—0.05; 


as the PDF is 


I(0<t< 6), 


gr 
we obtain 
(b-*(8))” 
gr 
Then M = 0(0.05)!/” gives 6 = M/(0.05)!/". Hence, the 95 percent CI for 
6 is (M, M/(0.05)1/"). 
(ii) The Bayes formula 


= 0.05, whence b~!(9) = 6(0.05)!/”. 


x; 0) (6 
ve) = aig a Oh 
yields for the posterior PDF: 
n (Bx) o Fs 0) (0) = ag I(0 > 0) 
where 
c = e(x) = max [a, max] 
Thus, 


n(6|x) = (n+k—1)c"t*-19-"-F 16 > o). 


Next, with the quadratic LF, we have to minimise 


, x(0|x)(0 — 6)2a9. 
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This gives the posterior mean 


6° =arg min [ (0|x)(0 — 0)2d0 = | entelxas, 
0 


Now 
a 1 i 1 
vi g-"—-k+1qg = gn —k+2 = Conk 
: —n—k+2 o nt+k-2 : 
and after normalising, we obtain 
~ k+n-1 k+n-1 
= = max|, Wy 2.5 Lal: 


a, area em area 


Question 5.14 Write a short account of the standard procedure used by 
statisticians for hypothesis testing. Your account should explain, in partic- 
ular, why the null hypothesis is considered differently from the alternative 
and also say what is meant by a likelihood ratio test. 


Solution Suppose we have data 


* 
I 


In 


from a PDF/PMF f. We make two mutually excluding hypotheses about f, 

Ho (a null hypothesis) and H; (an alternative). 

These hypotheses have different statuses. Ho is treated as a conservative 

hypothesis, not to be rejected unless there is a clear evidence against it, e.g.: 

(i) Ho: f = fo against Hy: f = fi; both fo, fi specified. (This case is 
covered by the NP Lemma.) 

(ii) Ho: f = fo (a specified PDF /PMF) against H,: f unrestricted. (This 
includes the Pearson Theorem leading to x? tests.) 

(iii) f = f(-;@) is determined by the value of a parameter; Ho: 0 € Qo 
against H,: 6 € 01, where 09 N 0; = O (e.g. families with a monotone 
likelihood ratio). 

(iv) Ho: 6 € Oo against H,: 6 € O, where Qo C O, and O has more degrees 
of freedom than Op. 


A test is specified by a critical region C such that if x € C, then Ho is 
rejected, while if x ¢ C, Ho is not rejected (which highlights the conservative 
nature of Ho). A type I error occurs when Hp is rejected while being true. 
Further, the type I error probability is defined as P(C) under Hp; we say that 
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a test has size a (or < a), if maxy, P(C) < a. We choose a at our discretion 
(e.g. 0.1, 0.01, etc.) establishing an accepted chance of rejecting Ho wrongly. 
Then we look for a test of a given size a which minimises the type II error 
probability 1 — P(C), i.e. maximises the power P(C) under Hy. 

To define an appropriate critical region, one considers the likelihood ratio 


max f (x|Fh) 

max f(x|Ho) 
where the suprema are taken over PDFs/PMFs representing Ho and Hj, 
respectively. The critical region is then defined as the set of data samples x, 


where the likelihood ratio is large, depending on the given size a. 


Question 5.15 State and prove the NP Lemma. Explain what is meant 
by a uniformly most powerful test. 

Let X1,...,X;p be a sample from the normal distribution of mean @ and 
variance 1, where 6 € R is an unknown parameter. Find a UMP test of size 
1/100 for 


Ho:9<0, Hy:4>0, 


expressing your answer in terms of an appropriate distribution function. 
Justify carefully that your test is uniformly most powerful of size 1/100. 


Solution The NP Lemma is applicable when both the null hypothesis and 
the alternative are simple, i.e. Ho: f = fo, Hi: f = fi, where f; and fo are 
two PDFs/PMF's defined on the same region. The NP Lemma states: for all 
k > 0, the test with critical region C = {x: fi(x) > kfo(x)} has the highest 
power P(C) among all tests (i.e. critical regions) of size P(C). 

For the proof of the NP Lemma: see Section 4.2. 

The UMP test, of size a for Ho: 6 € Oo against H;: 6 € Oj, has the 
critical region C such that: (i) max|P9(C): 8 € Oo] < a, and (ii) for all C* 
with max|P9(C*): 8 € Oo] < a: Pe(C) > Pe(C*) for all 8 € Oy. 

In the example where X; ~ N(6,1), Ho: 6 < 0 and Hy: 6 > 0, we fix some 
6, > 0 and consider the simple null hypothesis that 6 = 0 against the simple 
alternative that 0 = 0,. The log-likelihood 


f(x; A1) Nae 
Fo) 


is large when )/, x; is large. Choose k; > 0 so that 


sp=Po (Som m) =o (ESE > 8) 1 -0( 2), 
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ie. ky//n = 24(0.01) = ©-1(0.99). Then 


for all 9 < 0. Thus, the test with the critical region 


ox {x Taxa} 


has size 0.01 for Ho : 6 < 0. 
Now, for all 6’ > 0,C can be written as 


f(x 6) ‘ 
C=a x >k 
f(x; 0) 
with some k’ = k/(0’) > 0. By the NP Lemma, P9(C*) < Pg(C), for all C* 
with P9(C*) < 0.01. Similarly for all 6’ > 0 


Pg (C*) < Pe (C) for all C* such that P9(C*) < zi; for all 0 < 0. 


So, C={x: }), Xi > ki} is size 0.01 UMP for Ho against Hy. 


Question 5.16 Students of mathematics in a large university are given a 
percentage mark in their annual examination. In a sample of nine students 
the following marks were found: 


28 32 34 39 41 42 42 46 = 56. 


Students of history also receive a percentage mark. A sample of five students 
reveals the following marks: 


53 58 60 61 68. 


Do these data support the hypothesis that the marks for mathematics are 
more variable than the marks for history? Quantify your conclusion. Com- 
ment on your modelling assumptions. 


distribution N(0,1) Fos Fea xt. x33 X%o 
95th percentile 1.65 4.78 6.04 23.7 22.4 21.0 
Solution Take independent RVs 
K(X) oN Gio) Se 
and 


Yi(= Y}") id N(l2, 03), =1,...,5. 
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If 0? = 03, then 


9 5 
1 2 1 yy2 
F= 5) (Ki -X) /ie-¥ mg dy 


where 


We have X = 40 and the values shown in the following table, with S>,(X; — 
X)* = 546. 


X; 28 32 34 39 41 42 42 46 56 
Kee 12 8 6 T° af> 28 92> “Be 906 
(X;-X)? 144 64 36 1 1 4 4 36 256 


Similarly, Y = 60, and we have the values shown in the table below, with 
> j(¥ — Y)? = 118. 


ve 53 58 60 61 68 
Y; -Y -7 -2 0 1 8 
(Y¥,;-Y)? 49 4 0 1 64 


Then 


F= 1546 Mis as =~ 2.31. 
8 4 118 


But y4(0.05) = 6.04. So, we have no evidence to reject Ho : 7? = 03 at the 
95 percent level, i.e. we do not accept that of > 03. 


Question 5.17 Consider the linear regression model 
Y¥,=at Pu; + &, 6 ~ N(0,07), @=1,..-,7, 


where 21, ..., Z, are known, with }>;., x; = 0, and where a, € R and 
o? € (0,00) are unknown. Find the maximum likelihood estimators @, B or 
and write down their distributions. 

Consider the following data. 


x —-3 -2 -1 0 1 2 83 
Y¥, -5 O 38 4 3 0 -5 


Fit the linear regression model and comment on its appropriateness. 
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Solution RV Y; has the PDF 
fy,(y; @, B, 0?) = (20)“V2—— Wa Bas) (20°, 
with 
In fy (ys, 8,0”) = —5 no? — “(yi — @ — Bai)?/(20”). 
i 
To find the MLEs @ and B , consider the stationary points: 


0 


ae 


Soi —a— Bx,;)* = —2n(¥ — a), whence @=Y, 


a 


O 2 ~  Szy 
(— ag Dl — a Bas) — “2 Ra Gare) ae eee 


where Y = 5°, Yi/n, Sey = >>, 4:Yi and Sex = >>, x?. The fact that they 

give the global maximum follows from the uniqueness of the stationary point 

and the fact that fy(y;a, 8,07) > 0 as any of a, 8 and a? > o or 0? > 0. 
Set R= (Yi — @- Bx)", then at G7: 


R 
0 g ( “ino? =) = Bag whence G7 = —. 
n 


= T 
Oo? 2 202 a2"? 


The distributions are 


2 a 
an N(a =) , Bw x (a0? So) and R/o? ~ x72: 
a i 


In the example, Y = 0 and S,y = 0, hence @ = B = 0. Further, R = 84, 
ie, o7 S14. 

This model is not particularly good as the data for (x;,Y;) show a 
parabolic shape, not linear. See Figure 5.1. 


Question 5.18 The independent observations X1, X2 are distributed as 
Poisson RVs, with means p11, p12 respectively, where 


In py =a, 
In 2 = at B; 
with @ and 8 unknown parameters. Write down ¢(a, 3), the log-likelihood 
function, and hence find the following: 
(i) Oe APL APL oa 
i 
Oa?’ 0adB’ OB?’ 


(ii) 8, the maximum likelihood estimator of 3. 


d 
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AY; 
@ 
e e 
© @ > 
xj 
e@ e 
Figure 5.1 


Solution We have, for all integer 71, x2 > 0, 


1 x2 
£(21, 02; 0,8) =In(e™ By ue He 
x1! Ly! 
=-e* 4+ 21a — eth 4 r(a + B) —In(ay!29!). 
So, (i) 
0? g2 3 
—_ f = —e*(] B 7p pot cay Peers 
For’ e*(1+e”), aaaB on) foes d 


(ii) Consider the stationary point 


st =0 > ata =e $04, stn => ay = e8t8, 
Hence, 
@=Inx,, B=In = 
aaa 
The stationary point gives the global maximum, as it is unique and + —oo 
as |a|,|8| — oo. 


Question 5.19 The lifetime of certain electronic components may be as- 
sumed to follow the exponential PDF 


1 t 
fet) = 9 oxP (-§) , for t>0, 


where ¢ is the sample value of T. 
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Let ti, ..., t, be arandom sample from this PDF. Quoting carefully the 
NP Lemma, find the form of the most powerful test of size 0.05 of 


Ho: @=069 against H,;: 0@=06, 


where 09 and 6; are given, and 09 < 6. Defining the function 


show that this test has power 
0 
1G; Ponta = a)] 
01 
where a = 0.05. 
If for n = 100 you observed >¢,ti/n = 3.1, would you accept the hy- 


pothesis Ho: 69 = 2? Give reasons for your answer, using the large-sample 
distribution of (JT; +---+T),)/n. 


Solution The likelihood function for a sample vector t € R” is 


ty 


1 1 ' 
F(t) = gn exP (-44) Famine; > 0)5. t= 


tn 
By the NP Lemma, the MP test of size a will be with the critical region 


Cmts yp 


such that 
[ faottat = 0.05. 
c 


Jo, (t) = A e 1 1 | 1 i 
foo(t) (7) exp (i = x) > and 3 = 7 


C has the form 
{! Seti > ‘ 


1X ’ 
= 8 2% ~Gam(n,1), with Pg,(X <u) = G,(u). 


for some c > 0. Under Ho, 
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Hence, to obtain the MP test of size 0.05, we choose ¢ so that 


1-Gr(—)=0.05, ie. 2 =G51(0.95). 
8 8 


Then the power of the test is 


which equals 


as required. 
As ET; = 6 and Var T; = n6?, for n large: 
O./n 
by the CLT. Under Ho: 69 = 2, 
TT; - 
es, 
Oo/n 


(0, 1) 


On the other hand, z,(0.05) = 1.645. As 5.5 > 1.645, we reject Ho. 


Question 5.20 Consider the model 
yi = Pot Pitit Pox? +e, for 1<i<n, 


where 21, ..., pm are given values, with 5°, 7; = 0, and where «1, ..., €n are 
independent normal errors, each with zero mean and unknown variance o?. 


(i) Obtain equations for (Bo, B1, Bo), the MLE of (60, 61, 62). Do not at- 
tempt to solve these equations. 
(ii) Obtain an expression for (1, the MLE of (; in the reduced model 


Ho: yi=Potfitite, 1l<i<n, 


with $7, 2; = 0, and «1, ..., €, distributed as above. 


Solution (i) As Y; ~ N(Bo + Gia: + G2x?, 07), independently, the likelihood 


F(¥10, 61, 62,0°) = |] : 172 exp | AC Bo — Bixs baa??| 


n 


_ Lye? 1 2\2 
= 1) exp | 555 So (yi Bo — Bix; — Boz)" | . 


i=1 
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We then obtain the equations for the stationary points of the LL: 


acer, = = Soy — Bo — Bixi — Box?) = 0, 
cae = = So vilyi — Bo — Bix: — Box?) = 0, 


a) _— 
al ==) 2} (yi — Bo — Biti — Box?) = 0. 


In principle, a? should also figure here as a parameter. The last system 
of equations still contains 07, but luckily the equation 00/00? = 0 can be 
dropped. 

(ii) The same is true for the reduced model. The answer 


Bs Se 
By = — 


WithiSay =) 0. Yi. See =>), x? is correct. Again, o? will not appear in the 


expression for (1. 


Question 5.21 Let (x1,...,2%) be arandom sample from the normal PDF 


with mean pj and variance o?. 


(i) Write down the log-likelihood function @(p, 07). 
(ii) Find a pair of sufficient statistics, for the unknown parameters (1, 07), 
carefully quoting the relevant theorem. 
(iii) Find (77,67), the MLEs of (1,07). Quoting carefully any standard dis- 
tributional results required, show how to construct a 95 percent CI 
for pw. 


Solution (i) The LL is 
C(x; bs o°) Ss In(2707) aoe S (x; ria 
’ ? 9 v7 ig 


(ii) By the factorisation criterion, T(x) is sufficient for (1,07) iff 
(2; w, 02) = In g( P(X), 1,02) + In h(x) 


for some functions g and h. Now 


lo; 4,02) = 5 Inno?) = 5 Sola — 2) + WIP, 
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The remaining calculations affect the sum 5°, only: 


as )>y"., (a; — %) =0. 
Thus, with T(X) = (X, >>,(X; — X)”), we satisfy the factorisation crite- 
rion (with h = 1). So, T(X) = (X, >7,(X; — X)?) is sufficient for (4,07). 
(iii) The MLEs for (1,07) are found from 


and are given by 
A es) Sax =)\2 
Lp , where Syg = Soa —%)*. 
i 
We know that 


XwW~N o d : Sxx 2 
~ — and —; ~ : 
L, n o2 Xn-1 


X-p /1 [Sxx : 
of~n f oVn-1 7 


So, if tp—1(0.025) is the upper point of tn—1, then 


Then 


2 1 = 1 
(2 — —=Srztn—1(0.025), + Sa sertn-1(0.025) ) 
n 


vn vn 
is the 95 percent CI for wu. Here sez = \/Szx/(n — 1). 


Question 5.22 Suppose that, given the real parameter 0, the observation 
X is normally distributed with mean @ and variance v, where v is known. If 
the prior density for @ is 


m(8) x exp [ — (8 — po)*/2v0], 


where jig and vp are given, show that the posterior density for 0 is 7(6|x), 


where 


(6|x) x exp [ — (0 — p1)?/2u1], 
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and 41 and wv; are given by 


_ po/vo + x/v 1 1 
~— Lf/ug + 1/v’ V1 Uy Ue 


Sketch typical curves 7(@) and 7(6|x), with jo and x marked on the 6-axis. 


Solution We have 


F(2:6) exp | ake a) oo <n < OO. 


2u 
Then 
: 1 2 2 
r(Ohe) o F(:8)n(0) exp [3 (e— 0)? — (0 — no} 
Write 
— 9/2 = 2 2 2 
(w= 0) | 0=Ho) -(2 1) no(E4 i) 4H E 
UV VO U i) U VO 0) U 


7 14 9 a/v + po/v0 \? 

lu v9 1/v9 + 1/v 
«/v + Ho/vo “HG _@ 
1/vp + 1/v vo ov 


1 
— (0 — ,)* + terms not containing 8, 
U1 


where j11, V1 are as required. 
Thus 


x(Ghe) o exp [-2-(0— 1)? — 5] oxen [- 2-0-1) 


Both PDFs 7(@) and x( - |x) are normal; as Figure 5.2 shows, the variance 
of 7 is larger than that of x(- |x). 


Question 5.23 Let (nj;) be the observed frequencies for an r x c contin- 
gency table, let n = )“;_, ae ni and let 


E(niyj)=npy, L<i<r, 1l<j<e, 


thus 3 yy Pij =1. 
Under the usual assumption that (nj;) is a multinomial sample, show that 


the likelihood ratio statistic for testing 


Ho: pig = 485, 
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My <Ho if X<po 


Hy >Ho if x> po 


A 


Ho 


A 


Figure 5.2 


for all (i, 7) and for some vectors a = (qj, ... 


D=2 > » Nig In(ni;/e:j), 


i=1 j=l 


> 0 


; ap) ond 8 = (Bisex 


Be) is 


where you should define (e;;). Show further that for |nj; — e;;| small, the 
statistic D may be approximated by 


Z? = y 3 (nig — 045)" /eay- 


i=1 j=l 


In 1843 William Guy collected the data shown in the following table on 
1659 outpatients at a particular hospital showing their physical exertion at 


work and whether they had pulmonary consumption (TB) or some other 


disease. 
Disease type 
Level of exertion pulmonary other 
at work | consumption disease 
Little 125 385 
Varied Al 136 
More 142 630 
Great 33 167 


For these data, Z? was found to be 9.84. What do you conclude? (Note 
that this question can be answered without calculators or statistical tables.) 


Solution With ni; standing for counts in the contingency table, set 


r 


(es 
ae oie) =e, 13157, 127 Se 


i=1 j=1 
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We have the constraint: ae ; Pij = 1. Further, 
Ho: pig = 0485; So ai = = =1; Hi: py arbitrary, Pai = = 1, 
i iJ 
Under Hj, in the multinomial distribution, the probability of observing a 
sample (n;;) equals i,j Pig? /nij!. The LL, as a function of arguments p,j, is 
£( (pi;) = ni In py + A, 
iJ 


where A = —}7,,In(njj!) does not depend on pj;j. Hence, under the con- 
straint }); ; Pij = 1, ¢ is maximised at 


Pij a 
Under Ho, the LL equals 


l(a4, Bj) = eG 


where B does not depend on aj, 6;, and 
M+ = ) Nig, 4g = ) ij. 
j i 


Under the constraint }* a; = > 6; = 1, @ is maximised at 


Thus, 


Then the LR statistic 


max [¢(py) + py = 1 


2In = 2) nizln(niz/es;) 
max [e(cxi, Bj) # Sareea] s ‘ 


coincides with D, as required. 
With Nig => €ij = Oi, then 


D=2 S “(eis + dij) In (1 + “) 
tJ 
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and, omitting the subscripts, 


The data given yield Z? = 9.84. With r = 4, c = 2, we use y3. The value 
9.84 is too high for Me so we reject Ho. The conclusion is that incidence of TB 


rather than other diseases is reduced when the level of exertion increases. 


Question 5.24 Ina large group of young couples, the standard deviation 
of the husbands’ ages is four years, and that of the wives’ ages is three years. 
Let D denote the age difference within a couple. 

Under what circumstances might you expect to find that the standard 
deviation of age differences in the group is about 5 years? 

Instead you find it to be 2 years. One possibility is that the discrepancy 
is the result of random variability. Give another explanation. 


Solution Writing D = H — W, we see that if the ages H and W are inde- 
pendent, then 


Var D = Var H + Var W = 424+ 37 = 5%, 


and the standard deviation is 5. Otherwise we would expect Var D # 5, so 
value 5 is taken under independence. 
An alternative explanation for the value 2 is that H and W are correlated. 


If H = aW +e with W and ¢ independent, then with Var H = 16 and 
VarW =9 
Var H = 16 = a’ Var W + Vare, 
Var(H — W) =4=(a—1)? Var W + Vare. 


Hence, 


12 = (a? — (a — 1)”) Var W, 


and a = 7/6. 


It’s simple: consider these standard deviations normal. 
(From the series ‘Thus spoke Supertutor’. ) 


Tripos examination questions in IB Statistics 399 


Question 5.25 Suppose that X1,...,Xn and Y1,...,Ym form two inde- 
pendent samples, the first from an exponential distribution with the para- 
meter A, and the second from an exponential distribution with parameter w. 


(i) Construct the likelihood ratio test of Hp : A = against Hy: AF pw. 
(ii) Show that the test in part (i) can be based on the statistic 


EA Xj 
dete Xi + OG YG 


(iii) Describe how the percentiles of the distribution of T under Ho may be 
determined from the percentiles of an F-distribution. 


p= 


Solution (i) For X; ~ Exp(A), Y; ~ Exp(w), 


n 


re ee 
;d) =] ] Qe”) T (minx; > 0); maximised at AT = — N° ai, 
f(x; A) (Ae **) I (mina; > 0); maximised a oa 


i=l 
with 
max [f (x; Ne Aes 0| = Nien, 
and 
m 1 
=|[( jie ar) (min Yj > 0); maximised at oe =r So yi; 
m 
j=l | 
with 


—m™m 


max [f(y;): u > 0] =f'"e 
Under Ho, the likelihood is 


f(x: 0) f(y;0); maximised at 07! = an ee 


i=1 j=1 
with 
max [f (x; 0) f(y;0): 0> 0| = prime (etm), 


Then the test is: reject Hg if the ratio 


ware _ (samt Eas) ( n ie m \" 
gn+m n+-m oe Xi ae Yj 


is large. 
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(ii) The logarithm has the form 


(n+m) In Scat Soy; —nlny>aj—miny\y; 
i j a j 


plus terms not depending on x,y. The essential part is 


nin Dui ti min yi uw 
tit OY tit OY 


=-nlnT-—mln(1—-T). 


Thus the test is indeed based on T’. Furthermore, Ho is rejected when T’ is 
close to 0 or 1. 


(iii) Under Ho, A = w = 9, and 


205° X, Se 20S °Y; ~ Xen 
a j 


Hence, 


>; Yi /(2m) 
1 eg ek ae 7 oe ER 
Tage +s eR On) one 


Thus the (equal-tailed) critical region is the union 


Ce (0 (2 + hnan(a/2)) U ((1 + ee) : 1) 


and it is determined by percentiles of Fam on. 


Question 5.26 Explain what is meant by a sufficient statistic. 

Consider the independent RVs X1, X2,...,Xn, where X; ~ N(a + Gc, 6) 
for given constants c;, 7 = 1,2, ..., n, and unknown parameters a, 8 and 0. 
Find three sample quantities that together constitute a sufficient statistic. 


Solution A statistic T = T(x) is sufficient for a parameter 6 if f(x;0) = 
G(T, @) h(x). 
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For a data vector x with entries 71,...,2%n, we have 


f(x) = [TL fol 


= ae | sgt a 2a) 
1 


~ (Qn@yn72 *P =a I wy — 2 » ai(a+ Be;) + Ze a oe 


_ 1 1 2,a 
(270)"/2 exp | 20 ae 


Thus, the triple 


is a sufficient statistic. 


Question 5.27 Let Xi, Xo2,...,Xn be a random sample from the 
N(6,07)-distribution, and suppose that the prior distribution for 6 is the 
N(,7?)-distribution, where o?, 4 and 7? are known. Determine the poste- 
rior distribution for 0, given Xj, X2,..., Xn, and the optimal estimator of 
6 under (i) quadratic loss, and (ii) absolute error loss. 


Solution For the first half, see Worked Examples 2.3.8 and 2.3.10. For the 
second half: as was shown in Worked Example 2.3.8, the posterior distribu- 
tion is 


V27Tn, Di ge 
where 
1 1 ia — pl/t? +n&/o? 
Te 72 g?? os 1/7? +n/o? © 


As the normal distribution is symmetric about its mean, the best estimators 
in both cases are E(@|x) = Ln. 
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Question 5.28 Let X1, X2,..., Xn form a random sample from a uni- 
form distribution on the interval (—0,26), where the value of the positive 
parameter @ is unknown. Determine the maximum likelihood estimator of 
the parameter 0. 


Solution The likelihood f(x; @) is 


1 1 
—— I(- 1225 En < 20) = 
(aan Pec. wn 20) Baye 


Hence, the MLE is of the form 


I(—6 < min 2;)I (max x; < 26). 


6=max |—ming;, 5 Max Bj : LO 


Question 5.29 The y?-statistic is often used as a measure of the discrep- 
ancy between observed frequencies and expected frequencies under a null 
hypothesis. Describe the y?-statistic, and the y? test for goodness of fit. 

The number of directory enquiry calls arriving each day at a centre is 
counted over a period of K weeks. It may be assumed that the number of 
such calls on any day has a Poisson distribution, that the numbers of calls 
on different days are independent, and that the expected number of calls 
depends only on the day of the week. Let n;, i= 1, 2,...,7, denote, respec- 
tively, the total number of calls received on a Monday, Tuesday, ..., Sunday. 

Derive an approximate test of the hypothesis that calls are received at the 
same rate on all days of the week except Sundays. 

Find also a test of a second hypothesis, that the expected numbers of calls 
received are equal for the three days from Tuesday to Thursday, and that 
the expected numbers of calls received are equal on Monday and Friday. 


Solution Suppose we have possible counts n; of occurrence of states 1 = 
1,...,n, of expected frequencies e;. The x7-statistic is given by 


n 
= (nj a e;)? 


The x? test for goodness of fit is for Ho: p; = pi(0), 9 € ©, against Hy: p; 
unrestricted, where p; is the probability of occurrence of state 7. 

In the example, we assume that the fraction of calls arriving during all days 
except Sundays is fixed (and calculated from the data). Such an assumption 
is natural when the data array is massive. However, the fractions of calls 
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within a given day from Monday to Saturday fluctuate, and we proceed as 
follows. Let e* = ¢(n1 +--+: +6), e] = 3(n2+n3 +4), eb = $(ny +5). 

Under Ho: on Monday-Saturday calls are received at the same rate, the 
statistic is 


6 
ra 
* 
i=l . 
and has an approximately x? distribution. (Here the number of degrees of 
freedom is five, since one parameter is fitted.) 
In the second version, Hg is that on Tuesday, Wednesday and Thursday 
calls are received at one rate and on Monday and Friday at another rate. 
Here, the statistic is 


4 2 
(ni — ej) a S- (nj 7 &2 
eS 


i= j=1,5 


and has an approximately y3 distribution. 


Question 5.30 (i) Aerial observations x1, %2,23,24 are made of the inte- 
rior angles 01,62, 03,04, of a quadrilateral on the ground. If these observa- 
tions are subject to small independent errors with zero means and common 
variance a7, determine the least squares estimator of 61, 02, 03, 04. 

(ii) Obtain an unbiased estimator of o? in the situation described in 
part (i). 

Suppose now that the quadrilateral is known to be a parallelogram with 
6, = 93 and 6) = 64. What now are the least squares estimates of its angles? 
Obtain an unbiased estimator of o? in this case. 


Solution (i) The LSEs should minimise em (9; — 24)", subject to par 0; = 
27. The Lagrangian 


Le Lay (Laan) 


has 


<b=0 when 2(0;-—2;)-A=0, ie. 6; — a; + 


Adjusting > 6; = 2m yields \ = 5 (27 — D>, vi), and 


0; = a4 + i (Sa), 
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(ii) From the least squares theory, X;, 7 = 1,2,3,4 has mean 6;, variance 
o”, and the X; are independent. Write 


i(e-s)] -ne{es#(E4)] 


Thus, 


and )0,(X;— 6;)? is an unbiased estimator of o?. 


If 6; = 63 and 02 = 64, the constraint becomes 2(0; + 02) = 27, ice. 
0; + 62 = 7. Then the Lagrangian 


L = (0, — 21)* + (02 — £2)* + (01 — 23)? + (2 — 24)? — 2 (01 + 02 — 7) 


has 


02 


and is minimised at 


n 


1 ~ 1 
0, — 5 (1 + £3 +), A — 5 (2 + £4 + X). 


The constraint 6, + 62 = 7 then gives \ = (a = ¥ ae; /2); and 


1 1 1 
= 5 (1 w3) +7 (2 >) = j(ei +03 — 22-04) +5, 


and similarly 


om | 7 
02 = 7 (a2 + 44 — 21 — 23) + 5. 
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Now 


4 4 4 4 2 


= 3X4 X38 X2 X4 
= var ( ri mi as rl + =) 


1 Q- Ok 


The same holds for i = 2,3, 4. Hence, S>(2; ~6;)? /3 is an unbiased estimator 
of o?. 


XD a Da, Se: A 
z (x1 - 41) -2( 1 Bee a ee 5 


Question 5.31 (i) X1, X2,..., Xp form a random sample from a distri- 
bution whose PDF is 


2 <a< 
F(i8) =f ie 0<2<0 


otherwise, 
where the value of the positive parameter @ is unknown. Determine the MLE 
of the median of the distribution. 

(ii) There is widespread agreement amongst the managers of the Reliable 
Motor Company that the number z of faulty cars produced in a month has 
a binomial distribution 


Pie=s)=(" pap, s=0,1,...573 0 <= p = 1. 


There is, however, some dispute about the parameter p. The general man- 
ager has a prior distribution for p which is uniform (i.e. with the PDF 
fp(z) = 1(0 < « < 1)), while the more pessimistic production manager has 
a prior distribution with density f,(x) = 2rI(0 < x < 1). Both PDFs are 
concentrated on (0, 1). 

In a particular month, s faulty cars are produced. Show that if the general 
manager’s loss function is (p— p)”, where p is her estimate and p is the true 
value, then her best estimate of p is 


stl 
n+2° 


p= 


The production manager has responsibilities different from those of the 
general manager, and a different loss function given by (1 — p)(p—p)?. Find 
his best estimator of p and show that it is greater than that of the general 
manager unless s > n/2. 


406 Tripos examination questions in IB Statistics 


You may assume that, for non-negative integers a, 6, 


1 13) 
a _ B _ al! 
[wa p)"dp (a+ 6+)! 


Solution (i) If m is the median, the equation 


Pe x 1 
ne Ble 8 
gives m = 0//2. Then 
22 x 
EON) lea = ep 0<x<mv2, 
is maximised in m by t/V/2, and f(x;mV2) =] f(ai;mv2) by 
m= gone 


(ii) As P,(X = s) x p*(1—p)"*, s = 0,1,...n, the posterior for the 
general manager (GM) is 


a&M (pls) x p*(1 — p)" *1(0 <p <1), 
and for the production manager (PM) 
m7? M (pls) x pp*(1— p)"*1(0 <p < 1). 


Then the expected loss for the GM is minimised at the posterior mean: 


st+1)!(n—s)! (n-—s+s+1)! s+] 
m—s+s+2)! si(n—s)! n+2 


For the PM, the expected loss 


1 
i: (1 —p)(p —a)2x?™(p|s)dp 
is minimised at 


1 1 
a= f pt = p)a?"(pls)ap / [apy ersa, 
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which yields 


1 1 
j= = [oa-p ae opmran | [9 —v — p)"*dp 
0 0 


_ (s+2)!(n—s +1)! (n—s+s+8)! s+2 


(n—st+ts+4)! (s+1)(n—s+l1)! n+4 


We see that (s+ 2)/(n +4) > (s+1)/(n+4 2) iff s < n/2. 


Question 5.32 (i) What is a simple hypothesis? Define the terms size and 
power for a test of one simple hypothesis against another. 

State and prove the NP Lemma. 

(ii) There is a single observation of an RV X which has a PDF f(z). 
Construct the best test of size 0.05 for the null hypothesis 


1 
Ho: f(z)=5 (-1<@<1) 
against the alternative hypothesis 
3 2 
Hh: f(a) = 5-2) (-1<e<1) 


Calculate the power of your test. 


Solution (i) A simple hypothesis for a parameter 0 is Ho : 0 = 00. The size 
equals the probability of rejecting Hp when it is true. The power equals the 
probability of rejecting Hp when it is not true; for the simple alternative 
Hi: 0 = 6; it is a number (in general, a function of the parameter varying 
within H,). 

The statement of the NP Lemma for Ho: f = fo against H,: f = fy is 
as follows. 

Among all tests of size < a, the test with maximum power is given by 
C={a: fi(x) >kfo(x)}, for k such that P(x € C\Ho) = Jo fo(x)dx = a. 

In other words, for all & > 0, the test: 


reject Hp when fi(x) > kfo(x) 


has the maximum power among the tests of size < a := Po(fi(X) > 
k fo(X)). 
The proof of the NP Lemma is given in Section 4.2. 
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(ii) The LR fi(ax)/fo(x) = 3(1 — x?)/2; we reject Ho when it is > k, ice. 
|a| < (1 — 2k/3)"/?. We want 


2K 1/2 ok\ 1/2 
0.05 = P (\« < (1- >) m0) = (1- >) 


That is, the condition |2| < (1 —2k/3)!/? is the same as |a| < 0.05. By the 
NP Lemma, the test 


reject Hp when |z| < 0.05 


is the most powerful of size 0.05. 
The power is then 


0.05 
~ 0.075. 


3 0.05 3 3 
P(|x| < 0.05|Mi) = 5 [,..( —a*)dx = 5 (« - =) 


0 


Question 5.33 (i) Let X1,..., Xm and Yi,..., Y, be independent random 
samples, respectively, from the N(j1,07)-and the N(,i2, 07)-distributions. 
Here the parameters f41, 2 and o? are all unknown. Explain carefully how 
you would test the hypothesis Ho: uw, = pe against Hy: pw F pe. 

(ii) Let X,,...,X, be a random sample from the distribution with the 
PDF 


{e360} = e 9) for 9< a <o, 


where @ has a prior distribution the standard normal N(0, 1). Determine the 
posterior distribution of 6. 

Suppose that 6 is to be estimated when the loss function is the absolute 
error loss, L(a,@) = |a — 6|. Determine the optimal Bayes estimator and 
express it in terms of the function c,(a) defined by 


2®(cn(z) —n) = ®(a—n), for —o<2£< ov, 
where 
Bc) = fea 
z) = — e 
V2 J—oo e 
is the standard normal distribution function. 
Solution With 


ae =e 
X= — DXi ~ N(m1,0°/m) and Y= — Yj ~ N(u2,0°/n), 
4 a 
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we have that under Ho 


re vy) ( : ~ 1) (00) 


oO m nr 
Set 
Sxx = (Xi-XP~o x1 Syy = Gj -YV)P ~ 0? xh 
i=1 j=l 
Then 
1 
=) (Sxx + Syy) ~ xXoan—2s 
and 
eee (aa | Ly 1/2 
G(R VY) oe [(Sxx + Syy)/(m+n—-2)] ~ tman—2- 


The MP test of size a rejects Ho when |t| exceeds tinin—2(a/2), the upper 
a/2 point of the tm+n—2-distribution. 
(ii) By the Bayes formula, 


1 (O|x) x 1(0) f(x; 0) x ae ease, "LTO < min); 
with the constant of proportionality 
— 6? /24n0-Y); wif 0 j \da = 1 exp ( n / ji 
(fe (@ < min z;) Je Onan 


Under absolute error LF L(a,@) = |a — 6|, the optimal Bayes estimator is 
the posterior median. That is we want s, where 


/ * q6e~O-n)?/2 = ; / mess = O=n)?/2 ag, 


—Cco 
or, equivalently, 26(s — n) = ®(minz; — n), and s = c,(minz;), as 
required. 
Question 5.34 (i) Let X1,..., X, be a random sample from the distri- 


bution with the PDF 


2 
f(0) =F, for O< a <0. 


Determine the MLE ™ of @ and show that (4, M/(1- 1) is a 100y per- 
cent CI for 6, where0<¥ <1. 

(ii) Let X1, ..., X» be an independent random sample from the uniform 
distribution on [0, 6;] and let Yi, ..., Y;, be an independent random sample 
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from the uniform distribution on [0,62]. Derive the form of the likelihood 
ratio test of the hypothesis Hg: 6; = 02 against H;,: 6; 4 62 and express this 
test in terms of the statistic 

_ max (M x, My) 

min (Mx, My) , 


where My = maxi<i<n X; and My = maxi<i<n Yj. 

By observing that under the hypothesis Hp the distribution of T is inde- 
pendent of 6 = 0; = 62, or otherwise, determine exactly the critical region 
for the test of size a. 


Solution (i) The likelihood 


Nie ated _ 1 «| I(max x; < 6) 


is written as g(M(x,6))h(x), where M(x) = maxz;. Hence, M = M(X) = 
max X; is a sufficient statistic. It is also the MLE, with 


Uy 2n 

P(M <u) = (P(X, <u))"= (5) , 0<u<d. 
Then 

) =P (a1 — >" <u <0) 


-1-P ( Pa ae yen) 


2n 
_,_ =p) 
Q2n 
alee 


Hence, (M,M /(1 - 7) V/@n)) is a 1004 percent CI for @. 
(ii) Under Ho, 


1 2n 
fxy = (3) EOS Bis oes CALs aes Ua 8), 


is maximised at the MLE @ = max (Mx, My). Under Hi, 
cle ile Ai Oe 
fxy = (=) (=) OS ys ia Aap) TO ie hn Dy Oo): 
1 2 


is maximised at the MLE (O,, 62) = (Mx, My). 
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Then the ratio Ay, ;H (x,y) is 


es) Gi) _ [max (Mx, My)?” _ [max(Mx, My) ‘ _ pn 
1 an (Mx My)” ee] 
| 1 


max (Mx, My 


So, we reject Ho if T(x, y) > k for some k > 1. 
Now, under Ho 


P(Mx <2)= (=)", ie. fuy (x) = na a 


and similarly for My. Then, for 0 < a < 1 and k > 1, we want a to be 
equal to 


1 

— 9n? A iol a nie" 2 lk 

= 2n xy ydxz =2n | «x in ar 
0 0 0 


So, k = a~!/", and the critical region for the size a test is 


C= {x,y: Poa 


Question 5.35 (i) State and prove the NP Lemma. 

(ii) Let X1,..., X, be a random sample from the N(j, 07)-distribution. 
Prove that the RVs X (the sample mean) and )>"_,(X;—X)? ((n—1) x the 
sample variance) are independent and determine their distributions. 

Suppose that 


Xi Xin 
X21 X2n, 
Xm, nd Xmn, 


are independent RVs and that X;; has the N(;,07)-distribution for 1 < 
j <n, where j1, ..., Jim,” are unknown constants. With reference to your 
previous result, explain carefully how you would test the hypothesis Hg : 


My =*+'+ = Um- 
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Solution (Part (ii) only.) We claim: 
(a) K =D, Xi/n~ N(p,02/n); 
(b) X and Sxyx = 3>;(X; — X)? are independent; 
(c) Sxx = ),(Xi— X)?/o? ~ xh_1- 
To prove (a) note that linear combinations of normal RVs are normal, so, 
owing to independence, X ~ N (u, Grin) Also, (X — p)?/0? ~ x7. 
To prove (b) and (c) observe that 


DK — w)? = SOG — X) + (X — w)P 


= SOG - X)? + 2%: — EK p) + Fw) 


= Sxx +n(X — p)’. 
Given an orthogonal n x n matrix A, set 


X1 Y 
KX=|: |, Y=]: | where Y =A?(X = p21). 
Xn Yn 


We want to choose A so that the first entry Yj = /n(X — py). That is, A 
must be of the form 


Te ae eo ae 
as ie a 


where the remaining columns are to be chosen so as to make the matrix 
orthogonal. 
Then Yj = /n(X — p) ~ N(0,07), and Yj is independent of Y2, ..., Yn. 
Since >; Y¥? = >3;(X; — 4)’, we have 
n n 
SOY? = D0 (%G - u)? — n(X — pw)? = Sxx. 
i=2 i=1 
Hence Sxx = 7", Y,?, where Yo, ..., Y;, are IID N(0,07) RVs. Therefore 
Sxx/o? ~ ee 
Now consider the RVs Xj; as specified. To test Ho : fy = +--+: = fm = 
against Hy: 441, ..., {4m unrestricted, we use analysis of variance (one-way 
ANOVA). Write: = mined: Xa Sig eg ht See mM; 
where €;; are IID N(0, 07). 
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Apply the GLRT: the LR Ay,.H, ((zi;)) equals 


aie i yma) exp |- > ig (@ig — p1)?/(20°)| en 


Max), 42 (2002)—N/2 exp |- x ig (Lig — 1)?/(20?)| St 
Here 
So = So (aij —Fi+), S1= Say — %4), 
ij ij 
and 


fii = ae a = UW BYe , the overall mean (and the MLE of yu under Ho). 


The test is of the form 


Si 
reject Ho when eis large. 


Sy 
Next, 
m n 
So = s S°( Lig — Vi4- + Vi — Gia) 
i=1 j=l 
m n 
= (@g= ta) $267 = Sa Gate) + Ge Tee) 
i=1 j=l 
Lij — Bey) aoe n> -( U4 Daze) = 5, + So, 
a7 a 
where 


So = ie (Bie ee 


Thus, So/S1 is large when S2/5j is large. Here 5} is called the within groups 
(or within samples) and S2 the between groups (between samples) sum of 
squares. 


Next, whether or not Ho is true, 


So (Xi — Xi)? ~ 07 x2_, for all i, 


a 


since E.X;; depends only on i. Hence, 


2. 2 
Siro XN—m 
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as samples for different i are independent. Also, for all i, 
So (Xi _ Rey is independent of Xj. 
j 
Therefore, S; is independent of S2. If Ho is true, 
So ~ 07x24. 


Further, if Ho is not true, S2 has 


ES_ = (m—)o? +n) (ui — B’, 
i=1 


where fi = >>, pu;/m. 
Moreover, under Hj, the value of S2 tends to be inflated. So, if Ho is 

true, then 

_ S2/(m _ 1) 

~ $1/(N —m) 


while under H; the value of Q is inflated. Thus, in a size a@ test we reject 
Hp when the value of statistic Q is larger than On-1, N—m(@), the upper a 
quantile of F;,—1,v—m- 


Q 


o Fyp—1,N—m) 


The Private Life of C.A.S. Anova. 
(From the series ‘Movies that never made it to the Big Screen’.) 


Question 5.36 (i) At a particular time three high street restaurants are 
observed to have 43, 41 and 30 customers, respectively. Detailing carefully 
the underlying assumptions that you are making, explain how you would 
test the hypothesis that all three restaurants are equally popular against 
the alternative that they are not. 

(ii) Explain the following terms in the context of hypothesis testing: 


) simple hypothesis; 

) composite hypothesis; 

c) type I and type II error probabilities; 
) size of a test; and 

) power of a test. 


Let X1,..., Xp be a sample from the N(, 1)-distribution. Construct the 
most powerful size a test of the hypothesis Hg: uw = wo against Hy: w= pw, 
where p11 > Uo. 
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Find the test that minimises the larger of the two error probabilities. 
Justify your answer carefully. 


Solution (i) In total, 114 customers have been counted. Assuming that each 
customer chooses one of the three restaurants with probabilities pi, po, ps, 
independently, we should work with a multinomial distribution. The null 
hypothesis Ho : pj = po = p3 = 1/3, with the expected numbers 38. The 
value of the Pearson x?-statistic: 


p= Ss (observed — expected)? _ 25+9+64 98 


= 2.579. 
expected 38 38 


Given a € (0,1), we compare P with h}(a), the upper a quantile of x3. 
The size a test will reject Hp when P > hJ (a). 

(ii) (a) A simple hypothesis Ho is that the PDF/PMF f = fo, a completely 
specified probability distribution that enables explicit numerical probabili- 
ties to be calculated. 

(b) A composite hypothesis is f € a set of PDFs/PMFs. 

(c) The TIEP is P(reject Ho|Ho true) and the TITEP is P(accept Ho|Ho 
not true). For a simple Ho, the TIEP is a number, and for a composite Hp it 
is a function (of an argument running over the parameter set corresponding 
to Ho). The THEP has a similar nature. 

(d) The size of a test is equal to the maximum of the TIEP taken over the 
parameter set corresponding to Ho. If Ho is simple, it is simply the TIEP. 

(e) Similarly, the power is 1 minus the TIIEP. It should be considered as 
a function on the set of parameters corresponding to Hy. 

To construct the MP test, we use the NP Lemma. The LR 


pa = exp € > [(2i — pu)? — (a - w') 


a 


= exp [n(u — po) + 5 (ui — 12)| 
is large when @ is large. Thus the critical region for the MP test of size a is 
CHA re} 
where 
a= Py (X >c) =1-@(Vn(e- po)), 
ie. 
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Ay 
y= 
1 
I X= B@ 
; = 
a=Ba) 1 oO 
Figure 5.3 


The test minimising the larger of the error probabilities must again be NP; 
otherwise there would be a better one. For size a, the TIEP as a function 
of a is 


B(a) = Pu, (X <0) = 8 (24(a) + Va(yo — 1)); 


where z;(q@) is the upper a quantile of N(0,1). Clearly, max [a,G(a)] is 
minimal when a = 6(qa). See Figure 5.3. 
We know that when a increases, z;(a@) decreases. We choose a with 


Jn 


a = ® (z4(a) + Vn(uo — p11) ,Le. z4.(a) = — 5 (Ho — p41). 


This yields 


_ 1 _ Hot 4 
c= fo — 5 (Ho — Ha) = — > —- 


Question 5.37 (i) Let X1,...,Xm be a random sample from the 
N(j1, 07)-distribution and let Y1,...,Y, be an independent sample from 
the N(p2, 03)-distribution. Here the parameters p11, 12, 07 and o% are un- 
known. Explain carefully how you would test the hypothesis Hp: a7 = o3 
against H,: of 4 03. 

(ii) Let Yi, ..., Yn be independent RVs, where Y; has the N(G2;,07)- 
distribution, i = 1,..., n. Here 21, ..., %, are known but § and o? are 
unknown. 


(a) Determine the maximum-likelihood estimators (3,62) of (8,02). 

(b) Find the distribution of B. 

(c) By showing that Y; — Ba; and B are independent, or otherwise, determine 
the joint distribution of 8 and 62. 
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(d) Explain carefully how you would test the hypothesis Ho: 8 = (o against 
Hi: 8 F Bo. 


Solution (i) Set 
~ 1 
Sea = 4— as ith —=S ~ XP ) 
S"(%i- 2)’, wi GOXX ~ Xm=1 


and 


3 


3) : 1 
Syy = (yj —_ 7)", with — Sry ~ Ne tts 
Oo 
j=l z 
Moreover, Sxx and Syy are independent. 
Then under Ho, 


1 1 
Sxx Syy ~ Fm-1n-1, 
m—1 n—1 
and in a size a test we reject Ho when [5z2/(m — 1)]/[Sy,/(n — 1)] is ei- 
ther < Pm—1n—1(2/2) or > Pn—1n—1(@/2), where Vm—1n—1 (0/2) is the 


upper/lower quantile of Fm—1n-1. 
(ii) The likelihood 


a ( ) ex -zs a = an ; 


(a) The MLE (B, G7) of (6,07) is found from 


270 


and is 


This gives the global minimum, as we minimise a convex quadratic function 
lf gyal Y.): 7 
(b) As a linear combination of independent normals, 3 ~ N (8, ae Sr Ba 
(c) As B and Y; — Ba; are jointly normal, they are independent iff their 
covariance vanishes. As E(Y; — Gx;) = 0, 


Cov(¥i — Bri, 8) = E(¥; — Bx;)8 = E(Yi8) — «:E(B?), 
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which is zero since 


2 ee A 
=; (6? + <7) =a (var + (8)) = 286%). 
kk 


In a similar fashion we can check that Y; — Bri, sang Yn Ban, B are inde- 


pendent and normal. Then G? and £ are independent. 
Next, the sum )>,(Y; — B2z;)? equals 


¥ [Bert @— Ayal]? = 2% — Baa? + (=~) (3-2)? 


a a 


As 
= 4 = Bx)? ~ Xm and =(/) (6-8) ~ Xb 


a 


we conclude that G?n/a? ~ y?_,. 


(d) Under Ho 
(8 — Bo)4/ age 


[Da Bai? 


~ tn-1- 


Hence, the test is 


reject Ho if |Z] > tp_1(a/2), the upper a/2 point of t,_1. 


Question 5.38 (i) Let X be a random variable with the PDF 
Fea) = e @-9) b<e2r<a, 


where @ has a prior distribution the exponential distribution with mean 1. 
Determine the posterior distribution of 6. 
Find the optimal Bayes estimator of @ based on X under quadratic loss. 
(ii) Let X1,...,X, be a sample from the PDF 


ater x > 0, 


f(a;A,n) =< >TH 
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where A > 0 and yu > 0 are unknown parameters. Find (simple) sufficient 
statistics for (A, 1) and determine the MLEs (A, 71) of (A, 11). 

Now suppose that n = 1. Is d an unbiased estimator of A? Justify your 
answer. 


Solution (i) For the posterior PDF, write 
n(O|x) xe Fe"? I(x > 9 >0) x I(0<0< 2). 


So, the posterior is U(0, x). 
Under quadratic loss, the optimal estimator is the posterior mean, i.e. 7/2. 


(ii) 


fer. w= 


are Devil (wi > O)/A+ DY | wil (ai < 0)/n 


is a sufficient statistic. 
To find the MLE (A, j2), differentiate @(x; A, w) = In f(x; A, 1): 


) n St 
By A, LL) dN as LL + +2 0, 
O n S- 
a A, LL) a 7a LL pe ’ 


whence 
Lat (st+ JS"), p= + (-s- + VF), 
n n 


which is the only solution. These values maximise @, i.e. give the MLE for 
(A, 4) as €— —oo when either or yz tend to 0 or oo. 

This argument works when both St,—S~ > 0. If one of them is 0, the 
corresponding parameter is estimated by 0. So, the above formula is valid 
for all samples x € R” \ {0}. 

For n= 1, A= al (x > 0). As 


a 1 0° 
= af ze */ Adar 
A+ f Jo 


obviously depends on sz, it cannot give 4. So, d is biased. 
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An exception is 4 = 0. Then E\ = \, and it is unbiased. In general, EST = 
nr2/(A + pL). So, 


ie 2 
LA = A : EY —S-St. 
At n 


The second summand is > 0 for n > 1 unless yw = 0 and S~ = 0. Thus, in 


the exceptional case yp = 0, E\ = for alln > 1. 


Question 5.39 (i) A sample 21, ..., Z, is taken from a normal distri- 
bution with an unknown mean p and a known variance o?. Show how to 
construct the most powerful test of a given size a € (0,1) for a null hypoth- 
esis Ho: 4 = zo against an alternative Hy: w= pw (wo 4 [1). 

What is the value of a for which the power of this test is 1/2? 

(ii) State and prove the NP Lemma. For the case of simple null and 
alternative hypotheses, what sort of test would you propose for minimising 
the sum of probabilities of type I and type II errors? Justify your answer. 


Solution (i) Since both hypotheses are simple, the MP test of size < a is 
the LR test with the critical region 


f(x|Hi) } 
C=4x: ——>k 
Ff (x|Ho) 
where & is such that the TIEP P(C|Ho) = a. We have 


f (x|Ho) “I = 7 (@i-H0)?/(207) f(x|Hy) = I] 1 e~ (iH)? / (207) 


and 


Aa i (3) = In FRE = PS Iles = pa)? = (es = Ho) 


oe ~~ [2a (1 — Ho) — (Hi — 46). 


Case 1: 41 > fo. Then we reject Ho when % > k, where a = P(X > k|Ho). 
Under Ho, X ~ N(pio,07/n), since X; ~ N(y0,07), independently. Then 
J/n(X — po)/o ~ N(0,1). Thus we reject Ho when % > po + o24(a)//n 
where z,(a) = &~!(1— a) is the upper a point of N(0, 1). 


The power of the test is 


1 i —(y—nuy)2/(202 
dye (U—nH)?/2o?n) 
V 20 JoV/nz4.(a)-+np 


eV dy, 


On ee 
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It equals 1/2 when z4(a) = (441 — po) /n/o, ie. 


a= et dy. 


v2n ed 


Case 2: p41 < plo. Then we reject Hp when % < pp — o24(a)/ Vn, by a 
similar argument. The power equals 1/2 when 


7(M1—Ho)/ 
oe © Pay, 
(Or 
Hence, the power is 1/2 when 
2 
= — e ¥/2dy. 
Vv 27 ee 


(ii) (Last part only.) Assume the continuous case: f(-|Ho) and fi( - |Hi) 
are PDFs. Take the test with the critical region C = {a : f(x|Hi) > 
f(2|Ho)}, with error probabilities P(C|Ho) and 1 — P(C|H,). Then for any 
test with a critical region C*, the sum of error probabilities is 


P(C*|Ho) + 1 — P(C*|H1) = 14 / [F(«|Ho) — f(w|Hi)] da 
C* 

ae [F (2|Ho) — f(e|H)] de + / [F(w|Ho) — f(2|E1)] de 
cence c*\C 

> 14 ‘ [F (lH) — f(a|Hi)] da 
C*NC 

as the integral over C* \C is > 0. 
Next, 


i / [F(w|Ho) — f(2|H1)] de 
C#NE 
eae [F (2|Ho) — f(«|Hi)] de — / [F(e|Ho) — f(e|Ei)] de 
Cc c\c* 


> 14 1 [F («|Ho) — f(o|Hi)] dar = P(C|Ho) + 1 - P(C|Eh), 
C. 


as the integral over C \ C* is < 0. 


It has to be said that the development of statistics after the Neyman-— 
Pearson Lemma was marred by long-lasting controversies to which a great 
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deal of personal animosity was added. The most prominent (and serious in its 
consequences) was perhaps a Fisher-Neyman dispute that opened in public 
in 1935 and continued even after Fisher’s death in 1962 (one of Neyman’s 
articles was entitled ‘The silver jubilee of my dispute with Fisher’). 

Two authoritative books on the history of Statistics, [51] and [115], give 
different versions of how it all started and developed. According to [51], 
p. 263, it was Neyman who in 1934-1935 ‘sniped at Fisher in his lectures and 
blew on the unquenched sparks of misunderstanding between the (Fisher’s 
and K. Pearson’s) departments (at University College London) with appar- 
ent, if undeliberate, genius for making mischief’. On the other hand, [115] 
clearly lays the blame on Fisher, supporting it with quotations attributed to 
a number of people, such as J.R. Oppenheimer (the future head of the Los 
Alamos atomic bomb project), who supposedly said of Fisher in 1936: ‘I took 
one look at him and decided I did not want to meet him’ (see [115], p. 144). 
In this situation, it is plausible that F.N. David (1909-1993), a prominent 
British statistician, who knew all parties involved well, was right when she 
said: ‘They [Fisher, Neyman, K. Pearson, E. Pearson] were all jealous of 
one another, afraid that somebody would get ahead’. And on the particular 
issue of the NP Lemma: ‘Gosset didn’t have a jealous bone in his body. He 
asked the question. Egon Pearson to a certain extent phrased the question 
which Gosset had asked in statistical parlance. Neyman solved the problem 
mathematically’ ([115], p. 133). According to [51], p. 451, the NP Lemma 
was ‘originally built in part on Fisher’s work’, but soon ‘diverged from it. 
It came to be very generally accepted and widely taught, especially in the 
United States. It was not welcomed by Fisher ...’. 

David was one of the great-nieces of Florence Nightingale. It is interesting 
to note that David was the first woman Professor of Statistics in the UK, 
while Florence Nightingale was the first woman Fellow of the Royal Statis- 
tical Society. At the end of her career David moved to California but for 
decades maintained her cottage and garden in south-east England. 

It has to be said that the lengthy arguments about the NP Lemma and 
the theory (and practice) that stemmed from it, which have been produced 
in the course of several decades, reduced in essence to the following basic 
question. You observe a sample, (21, ..., 2n). What can you (reasonably) 
say about a (supposedly) random mechanism that is behind it? According 
to a persistent opinion, the Fisherian approach will be conspicuous in the 
future development of statistical sciences (see, e.g. [43]). But even recognised 
leaders of modern statistics do not claim to have a clear view on this issue 
(see a discussion following the main presentation in [43]). However, it should 
not distract us too much from our humble goal. 
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Question 5.40 (i) Explain what is meant by constructing a CI for an 
unknown parameter 6 from a given sample 71, ..., %. Let a family of PDFs 
f(x;0), 00 < @ < oo, be given by 


eo (e-8) xr>, 
0, x<. 


F(ai8) = 


Suppose that n = 4 and x, = —1.0, rg = 1.5, 73 = 0.5, x4 = 1.0. Construct 
a 95 percent CI for 6. 

(ii) Let f(a; 4,07) be a family of normal PDFs with an unknown mean ju 
and an unknown variance a? > 0. Explain how to construct a 95 percent CI 
for pw from a sample 71,...,2,. Justify the claims about the distributions 
you use in your construction. 


My Left Tail. 
(From the series ‘Movies that never made it to the Big Screen’.) 


Solution (i) To construct a 100(1 — y) percent CI, we need to find two 
estimators a = a(x), b = b(x) such that P(a(X) < 6 < b(X)) > 1-7. 
In the example, 


f(x; 0) =e" Litt" T(x, > 6 for all i). 


So min X; is a sufficient statistic and min X; — 6 ~ Exp(n). Then we can 
take a = b— 6 and b= mina, where 


[oe 
i ne" dr =e = ¥. 
6 


With y = 0.05, n = 4 and minx; = —1, we obtain the CI 


In a different solution, one considers 


So Xi —né0~Gam(n,1), or 25 —0)~ Non: 


i=1 i=1 


Hence we can take 


a= (sx shal iia) | n, b= (so patra) / 
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where h5,,(y/2) are the upper/lower y/2 quantiles of x3,,. With y = 0.05, 
n=4and )°, x; = 2, we obtain 


1 hg (0.025) 1 hg (0.025) 
2 8 72 8 


The precise value for the first interval is [—1.749,—1], and that for the 
second 


|; 17.530 1 2.180 


= [—1.6912, 0.2275). 
; ae =| = [-1.0012,0.2275 


The endpoint 0.2275 is of course of no use since we know that 6 < x; = —1.0. 
Replacing 0.2275 by —1 we obtain a shorter CI, of the form [—1.6912, —1]. 
(ii) Define 


~ 1 . a2 1 :y)2 
at DA 
— —t 


Then (a) /n(fi — )/o ~ N(0,1) and (b) (n—1)G?/o? ~ x2_,, indepen- 
dently, implying that 
(ie ~ tn-1- 
o 
Hence, an equal-tailed CI is 


n nN 


A oO hi oO 
— —=tn—1(0.025), —=tn—1(0.025) | , 
Lt Vn n 1( ) Tins Vn n 1 ) 


where tn—1(a/2) is the a/2 point of tr_1, a = 0.05. The justification of 
claims (a), (b) has been given above. 


Question 5.41 (i) State and prove the factorisation criterion for sufficient 
statistics, in the case of discrete RVs. 

(ii) A linear function y = Ax + B with unknown coefficients A and B is 
repeatedly measured at distinct points 71,...,2 x: first n, times at x1, then 
ng times at x2 and so on; and finally nz times at xz. The result of the ith 
measurement series is a sample yj1,..-,Yin,;, 2 = 1,...,k. The errors of all 
measurements are independent normal RVs, with mean zero and variance 1. 
You are asked to estimate A and B from the whole sample y;;,1 < j < 
nj, 1<i<_k. Prove that the maximum likelihood and the least squares 
estimators of (A, B) coincide and find these. 

Denote by A the maximum likelihood estimator of A and by B the max- 


imum likelihood estimator of B. Find the distribution of (4, B). 
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Solution (Part (ii) only.) Define 


n)= Ms t= Zi iti 
i 


(n) 
and 


U= 27-2, with (un) al (u?n) = puee) 


Leta= A, 6 = B+ Az, then yj = aujt+B+e; and Yj; ~N(aut 8,1), ie 


frs(0s) = ee | 5 (ui Ui py). 


To find the MLE, we need to maximise I. “exp [- 5 (Yas — au — B)?]. This 
is equivalent to minimising the quadratic “Action ig Yas — au; — 8). 
The last problem gives precisely the least squares estimators. Therefore the 
MLEs and the LSEs coincide. 


To find the LSEs, we solve 


with 


That is 


| z il x 
Ba ae Oe a ey) 


Observe that (A, B ) are jointly normal, with covariance 


1 


Cov(A, B) = CR) 


Sy (A?2? + 2ABx;) uj 


aj 


(A?x? + B? +1+2ABz;)u? — AB. 
cr (u2n)2 
1,9 
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Question 5.42 (i) Let 71,..., 2, be a random sample from the PDF 
f(x; 0). What is meant by saying that t(x1,...,2%p) is sufficient for 0? 
Let 
e 9) > 8, 


Fad) = 4 ep 


and suppose n = 3. Let y, < yo < y3 be ordered values of 71, %2, 13. Show 
that y; is sufficient for @. 

(ii) Show that the distribution of Y; — @ is exponential of parameter 3. 
Your client suggests the following possibilities as estimators of @: 


0, = 13 — ie 
92 = 1, 
03 = $(x1 + 22 + 23) —i1. 


How would you advise him? 
Hint: General theorems used should be clearly stated, but need not be 
proved. 


Solution (i) T(x) is sufficient for 6 iff the conditional distribution of X given 
T(X) does not involve 6. That is, Pe(X € B\T = t) is independent of 6 for 
any domain B in the sample space. The factorisation criterion states that T 
is sufficient iff the sample PDF f(x; @) = g(T(x),0)h(x) for some functions 
g and h. For f(«;0) as specified, with n = 3, x = (21, 22,23), 


3 Ty 
FESO) = [Jeo 1: > 0) = ee Li (min x; > 8), x= | xe 
4=1 L3 


So, T = min X; := Y; is a sufficient statistic: here g(y,0) = e?’I(y > 6), 
A(x) =e7 La, 
(ii) For all y > 0: 
3 
Po(¥i —0 > y) = Po(X1, X2, X3 > y +0) =|] Po( Xi > y+ 0) =e. 
i=1 
Hence, Y; ~ Exp(3). 
Now, E(X3 — 1) = E(X3 — 6) +0-1=60, Var X3 =1. 
Next, Y; = min X; is the MLE maximising f(; 0) in 6; it is biased as 
1 


EY, = (Yi —O)+O= 5 +8. 


The variance Var 02 equals 1/9. 
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Finally, 


1 1 
ty Ges + Xa+Xa)-1) = 6, Var ((1 + X2+.%)-1) = 


We see that Y; has the least variance of the three. We could advise the client 
to use 2 bearing in mind the bias. However, the better choice is 62—1/3. 


Remark. The RB Theorem suggests using the estimator @= E(0;|02 = 
2 E(03|02 = = t). A straightforward computation yields that @=t— 1/3: 
Hence, £0 = 0 and Var@ = 1 /9. That is, the RB procedure does not create 
the bias and reduces the variance to the minimum. 


Question 5.43 (i) Derive the form of the MLEs of a, 6 and o? in the 
linear model 


¥j=a+ Ba, + &, 


1<i<n, where e ~ N(0,o7) and 07, 4 =0. 
(ii) What is the joint distribution of the maximum likelihood estimators 
B and G2? Construct 95 percent CIs percent for 


(a) 0”, 
(b) a+ 8. 


Solution (i) We have that Y; ~ N(a + 82;,07). Then 


flys, 8,0") = Tan? Io tt = =G7;) i ’ 
and 


(y; 0, 8,0”) = In flysa, 8,0?) = —5 In( 2n07) OG 58 Dali —a— Bx;) 


The minimum is attained at 


OP: By 2 0. he, By On pie, oy: 
Ha Vie Bo") = ally a, Bo") = sally a, Bo") =0, 
le. at 
—~ Ty a 
a=y, B= ary @ =7 be where € = Y; —@— G2j. 


(ii) Then we have 


2 2 
oO ~ oO nr 
anwN = ~N = (S)e~ ors 
a (« | $ B (6. or) o2 o Xn—-2 
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Also, 
a7 1 1 
Cov(a@, 3) = Cov 7 2% mie da 
— DC) = x => x; =0, 
Cov(a@, €&;) = Cov(a, Y; — a — Bux) 
1 n 
= — (35 - *) Cov(Y;, Yj) = 0, 
nes 
and 
Cov(B,€) = Cov(8, ¥; — & — Bai) 
ig 1 
j=l 
2 
or 
So, a, B and €1,...,€, are independent. Hence, a, B and G? are also inde- 


pendent. Therefore, (a) the 95 percent CI for o? is 


no? no? 
ht? ho 


where h~ is the lower and h* the upper 0.025 point of y?_5. 


Finally, (b) 
is, i, 1 1 
a+B~N(a+s,0° € +=) 
n x!x 
and is independent of ¢?. Then 


eres) [a 


Ba. 1 
Vn x!x 


Hence, the 95 percent CI for a+ £ is 


a+B ie 1 1 nm Bn B+ ae 1 n 
a oO } 9 p) 
Vn xtxVn-2 - Von xTx V n-2 


~ tn-2.- 


where ¢ is the upper 0.025 quantile of t,_9. 
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Question 5.44 (i) Describe briefly a procedure for obtaining a Bayesian 
point estimator from a statistical experiment. Include in your description 
definitions of the terms: 


(a) prior; and 
(b) posterior. 


(ii) Let X1,...,X, be independent, identically distributed (IID) RVs, 
each having distribution Gam(k, A). Suppose k is known, and, a priori, A 
is exponential of parameter pz. Suppose a penalty of (a — X)? is incurred on 
estimating ’ by a. Calculate the posterior for \ and find an optimal Bayes 
estimator for . 


Solution (i) We are given a sample PDF/PMF f(x;@) and a prior 
PDF/PMF z(6) for the parameter. We then consider the posterior distribu- 
tion 


7(O[x) oc (8) f(x; 8), 


normalised to have the total mass 1. The Bayes estimator is defined as the 


minimiser of the expected loss E,,g,)L(a,) for a given function L(a,@) 
specifying the loss incurred when @ is estimated by a. 

(ii) For the quadratic loss, where L(a, 0) = (a—6@)?, the Bayes estimator is 
given by the posterior mean. In the example given, the posterior is Gamma: 


m(A|x) x eT HA kn Ile ~N Gam G + 1, bb + S- «) . 


u v 


Hence, the Bayes estimator is given by its mean (kn +1) /(u+ 0,21). O 


Question 5.45 Let X1, Xo,...,X, be an independent sample from a nor- 
mal distribution with unknown mean p and variance a”. Show that the pair 
(X, 9°), where 


RSG ts Sea 


3 


is a sufficient statistic for (tu, 07). 
Given » > 0, consider \S” as an estimator of o?. For what values of ) is 
XS 
(i) maximum likelihood, and 
(ii) unbiased? 


430 Tripos examination questions in IB Statistics 


Which value of minimises the mean square error 


R(AS’ — 0?)2? 


Solution The likelihood 


ah i 1 
Fos no?) = (SL5) exp sie Sates 


= (==) & - de —@+E- X 


1 & it n 


Hence, by the factorisation criterion, (X, 5°) is a sufficient statistic. 
Maximising in pz: and o? is equivalent to solving 


O 0 
5g BP Gs 2") = 555 In fx: [,07) = 0, 


which yields ff = 7,67 = )>,(a; — T)?/n. Hence, (i) \S” is MLE for \ = 1. 
For (ii) 


3 


n 


knee = n(x, — xX)? = S- E(X; — p)? — nE(X — p)* 
i=1 i=1 


=nVar X — “nVar X = (1-10. 
n 


Hence, \S” is unbiased for \ =n /(n—1). 
Finally, set (A) = E(AS~ —o°)?. The differentiating yields 


o!(A) = 2E(AS” — 0”) 8”. 


Next, nS ~ (Y7+---+Y,?.,), where Y; ~ N(0,07), independently. Hence, 


E(S*)? =n HY? +--+ 24)? 
=n? [(n — 1)EY} + (n — 1)(n — 2) (EY? 4 
Ni [3(n — l)o* + (n-1)(n- 2)o"] =n-?(n? — 1)o*. 


2 


As ES 


= (n—1)o?/n, equation ¢/(A) = 0 gives 


GES _ on 
(Sy? ont 
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which is clearly the minimiser. 


The Mystery of Mean Mu and Squared Stigma. 
(From the series ‘Movies that never made it to the Big Screen’.) 


Question 5.46 Suppose you are given a collection of np independent RVs 
organised in n samples, each of length p: 


XY = (Xu,..., Xp) 
KOZ OG iow) 
XO Saisie Ao) 


The RV X;; has a Poisson distribution with an unknown parameter Aj;,1 < 
j <p. You are required to test the hypothesis that A; = --- = Ap against the 
alternative that at least two of the values A; are distinct. Derive the form 
of the likelihood ratio test statistic. Show that it may be approximated by 


with 


Explain how you would test the hypothesis for large n. 


Solution See Example 4.5.6. 


Question 5.47 Let Xj, X2,..., Xn be an independent sample from a 
normal distribution with a known mean p and an unknown variance o7 
taking one of two values 0? and o3. Explain carefully how to construct a 
most powerful test of size a of the hypothesis o = o; against the alternative 
o = 09. Does there exist an MP test of size a with power strictly less than 


a? Justify your answer. 


Solution By the NP Lemma, as both hypotheses are simple, the MP test of 
size < a is the likelihood ratio test 


reject Ho: 0 =o, in favour of H,: o = 02, 
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when 
(2003)—"/? exp [— 0, (ai — 1)? /(205)] 
(2mof)-"/? exp [— (ai — )?/(207)] 


where k > 0 is adjusted so that the TIEP equals a. Such test always exists 


>k, 


because the normal PDF f(x) is monotone in |z| and continuous. 
By rewriting the LR as 


o1\" iar ee 
(a) of (ap ag) Deo) 
we see that if 72 > 01, we reject Hg in the critical region 
Cct= {De ae ee “| 
i 


and if 01 > 09, in the critical region 


Furthermore, 

1 

a ae = pu)? ~ x2 under Ho 
and 

1 


+ SUX —p)? ~ x? under Hy. 
i 

The x2 PDF is 

‘os ee 


fy2 (2) = P(n/2) 27/2” [2-Ve 8/2 T (ap > 0). 


Then if o2 > 01, we choose k* so that the size 


= 1 1 


P(CTIH =f n/2-1 —#/24 = 
CHO atop <  e 


and if 01 > 02, k” so that 


k-/o? 1 1 
P(C-|H,) = eo n/2-1 —«/2q = 
(C71) = fo aaa tear = 


The power for og > oj equals 

~ 1 1 
— p(ctin,) — ee n/2-1,—a/2 
B (C*|H)) | Tim) 5npae e€ dx 


kt /o2 
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and for 01 > 02, 


k- /o2 
6 =P(C~|Hi) = i: aan 1 Melee [Pay 
0 


T (n/2) 2/2 


We see that if o2 > 01, then kt /o2 < kt /o?, and 8 > a. Similarly, if 0, > 
o2, then k~ /o? > k~ /o7, and again 6 > a. Thus, 8 < a is impossible. 


Gamma and Her Sisters. 
(From the series ‘Movies that never made it to the Big Screen’.) 


Question 5.48 Let €1, €2,..., €, be independent RVs each with the N(0, 
1) distribution, and 21, 22, ..., Zp» be fixed real numbers. Let the RVs 
Yi, Yo, ..., Yn be given by 


Y,=a+ 6aj+oe, 1L<i<n, 


where a, 3 € R and o? € (0,00) are unknown parameters. Derive the form 
of the least squares estimator (LSE) for the pair (a,() and establish the 
form of the distribution. Explain how to test the hypothesis 6 = 0 against 
8 £0 and how to construct a 95 percent CI for 6. 


Solution Define 


— 


= vj,a=at+6e and u=x7;-F, 


a 


with 
So ui =0 and Y;=a+ Bujt+oe. 
i 


The LSE aad for (a, 8) minimises the quadratic sum R = )7"_,(Y;-EY;)? = 
pares — Bu;)?, i.e. solves 


This yields 


and 
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independently. Also, 
~ oe =~ =2 — 
R=) 0% -@— Bai)? = OY? —n¥" — uTup? ~ 0? x72, 
4 a 


since we have estimated two parameters. So, R/(n — 2) is an unbiased esti- 
mator for o?. 


Now consider Ho: 8 = 69 = 0, Hi: 6 ¥ Bo. Then 
~ ( ag )  Bvuatu 
0 , i.e. = 


BON ~N(0,1) iff 6 =0, 


ulu 


i.e. under Hp. Thus 


So, given a, we reject Ho when the value of |T| exceeds tn—2(a/2), the upper 
a/2 point of tr—2. 

Finally, to construct an (equal-tailed) 95 percent CI for 8, we take 
tn—2(0.025). Then from the inequalities 


_ (6 =BuTa 


—tn-2(0.025) = 
R/(n — 2) 


< tn—2(0.025) 


we find that 


. R x R 


is the required interval. 


Question 5.49 The table below gives the number of 0.25 km? plots in 
the southern part of London each of which has been bombed 7 times during 
World War II. Check whether these data are consistent with the Poisson 
distribution law at the significance level a = 0.05. 


i} O 1 2 | 3 4 235) Total 
nj | 229 211 93) 35 7 1 576 


Solution Here 


6 = (2114+ 93 x 2+35 x 3+7x 44+5)/576 = 0.928819, 
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In this example the number of degrees of freedom k = 5, po = 0.39502, p, = 
0.366902, po = 0.170393, p3 = 0.0527547, p4 = 0.0122499, ps = 0.00268064, 


np? = (227.532, 211.336, 98.1464, 30.3867, 7.05594, 1.54405). 


Compute the Pearson statistics 


n 
Re (vj — np?)? 
Xn = Ds 7 
ju 
= (227.532 — 229)? /227.532 + (211.336 — 211)?/211.336 
+ (98.1464 — 93)?/98.1464 + (30.3867 — 35)? /30.3867 


+ (7.05594 — 7)?/7.05594 + (1.54405 — 1)?/1.54405 = 1.17239 
and compare with 


X6.9,5 = 9-236, X6.95,5 = 11.07, xd.975,5 = 12.83, x6.99.5 = 15.09. 


The data fit the model well. 


Question 5.50 Let X1,...,X, be arandom sample from a normal distri- 
bution with mean py and variance a”, where js and o? are unknown. Derive 
the form of the size-a@ generalised likelihood-ratio test of the hypothesis 
Ho : w = Mo against Hy : w ¥ wo, and show that it is equivalent to the 
standard t-test of size a. 


[You should state, but need not derive, the distribution of the test statistic.] 


Solution We have Xj,...,Xn ~N(,07), independently, and Ho: pp = po, 
Hy: « ~ fo, with an unknown o?. For given p and o?, the likelihood equals 


1 = 1 
2, = . 2 
L(p, 0°; 2) = ( =) exp |— 5% > (ti — p)*] . 
1l<i<n 
The likelihood ratio for Hp against H, is given by 


__ sup (L |H1) 
sup (L |Ho) 
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= 1 
Under Ho, the MLE o?4, = — ) (a; — Lo)” while under H,, the MLEs are 
n — 
a 


i oe or 1 _\2 oer 
fay =e and:-o7 y= . a (ae = z) . This yields 
a 


“3 \ 2 D(a — Ho)? 
a= (G8) -|o 


o7H, 


The condition A > c¢ is equivalent to 


Y(ei— Ho)? (wi — B)” + 0 0)? 


a se ot 2/n 
2 2c 


© (2: - z)” > (es -2) 


a a 


or to |T(z)| > [(n— ije/* = 1) where 


Question 5.51 Let X1,...,X, be HID RVs with 


where @ is an unknown parameter, 0 < 6 < 1, and n > 2. It is desired to 
estimate the quantity ¢ = 6(1 — 0) = nVar ((X, +---+ Xn) /n). 

(i) Find the maximum-likelihood estimator, d, of ¢. 

(ii) Show that b= X1 (1 — X92) is an unbiased estimator of ¢ and hence, 
or otherwise, obtain an unbiased estimator of ¢ which has smaller variance 
than 1 and which is a function of d. 

(iii) Now suppose that a Bayesian approach is adopted and that the prior 
distribution for 0, 7(@), is taken to be the uniform distribution on (0,1). 
Compute the Bayes point estimate of ¢ when the loss function is L(¢,a) = 
(6a), 


You may use that fact that when r, s are non-negative integers, 


1 
: xg (1—2)*drz =r'!s!/(r+s+1)! 
0 
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Solution (i) The MLE for @ is 


and for ¢ 


(ii) Because of independence, Eo, = EX) E(1 — X2). With EX; = 0, we 
obtain that 


Next, the statistic 


T=>) x, 


is sufficient for 06 and ¢. By Rao—Blackwell, the conditional expectation 


Dy (4: | ) is an unbiased estimator with a smaller variance. An explicit cal- 


culation yields 


E (GT =t) =P(X = 1, X2 = OT = 2) 


Therefore, 


= re n20(1-6 non 
2 (4117) ~ a a a, est 


(iii) With 
f(z|0) =07(1-0)"-* and 7(6) =100< 6 <1), 
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we obtain 


n(O\r) x OF(1—0)""7, O<0<1. 


Then the Bayes estimator for ¢ is given by 


1 
f 6T tly _ 0)"-T +146 


E(¢|X) = E(@(1—)|X) = ° 


for(a —o)n-Tap 
0 
(T4+ in —T+1)\n+1)! 
(n+3)!T\n —T)! 
(4+ Y(n—-T41) 

(n+2)(n+3) — 


Question 5.52 Suppose that X is a random variable drawn from the 
probability density function 


1 
fe | 4) = sla? te F/T), OO <r <0, 


where ['(0) = tie y?_'e-Ydy and 6 > 1 is unknown. Find the most powerful 
test of size a, 0 < a < 1, of the hypothesis Ho : 9 = 1 against the alternative 
H, : 6 = 2. Express the power of the test as a function of a. 

Is your test uniformly most powerful for testing Ho : 6 = 1 against H; : 


@>1? 


Solution We test Ho: 6 = 1 against H;: 6 = 2, with the ratio statistic 
fi(x)/fo(x) = |x|. So, the NP test is: reject Ho when |z| > C where C is 
related to a by 


+00 
a = Py, (|X| > C) = | edz = e~° (by symmetry), 
Cc 


whence C = — Ina. 
The power function is given by 


+oo 
p(a) = Pu,(|X|>—Ina) = fede = -2e-* 5 


—Ina 
—Ina 
+00 
+f e "dr = a(1—Ina). 
Ina 


To test Ho: 6 = 1 against Hj: 0 = 0; > 1, the ratio statistic becomes 
|x|"! /T(@,) and is monotone in |z|. Hence, again, the NP text rejects Ho 
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when |x| > C. This is true for all 6; > 1, therefore the test |x| > C is the 


uniformly most powerful. 


Question 5.53 Light bulbs are sold in packets of 3 but some of the bulbs 
are defective. A sample of 256 packets yields the following figures for the 
number of defectives in a packet. 


No. of defectives 0 1 2 3 
No. of packets | 116 94 40 6 


Test the hypothesis that each bulb has a constant (but unknown) proba- 
bility 6 of being defective independently of all other bulbs. 


Hint: You may wish to use some of the following percentage points. 


Distribution xz xe oe x3 ty ta t3 t4 
90th percentile | 2-71 4-61 6-25 7-78 3-08 1-89 1-64 1-53 
95th percentile | 3:84 5-99 7-81 9-49 6-31 2-92 2-35 2-13 


Solution The null hypothesis is Hg: p; = (*) 6'(1 — 6)?4, for i= 0, 1,2, 3, 


tested against the alternative Hy: p; € [0,1], po + pi + po + pg = 1. 
The MLE for 6 under Ho equals 


Ae 94x 1+40x24+6x3 _ 192 1 
3 x 256 768 4° 
whence the MLEs for p; are given by 
| ee 9 1 
Po = PL ey P2= 6a 3 = 64" 


Thus, the observed/expected table is 


O 116 94 40 6 
E 108 108 36 4, 


and Pearson’s chi-square statistics takes the value 


64 196 16 4 416 _, 
108 108 36 4 1087 ~ 


Comparing to the tail of x3, we accept Ho at the 90 percent level. 


Question 5.54 Consider the the linear regression model 
Y; =atr B Xi + Ej, 


where the numbers 271,...,%, are known, the independent RVs ¢«1,...,€n 
have the N(0, a7) distribution, and the parameters a, 3 and o? are unknown. 
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(a) Find the least squares estimates (LSE) a, 8 of parameters a, 8. Do 
they coincide with the maximum likelihood estimates (MLE)? 

(b) Find the MLE G? of parameter o?. How do you obtain an unbiased 
estimate of the parameter 0? 

(c) Find the joint distribution of estimates a, B, 6. 

(d) How to check the hypothesis Hp : a = ap against an alternative 
Hy: a#ao? 


Solution (a) Minimising the sum S = 5°, (Y; — a — Bx;)? we obtain 
250% —a-— Bx) = 25° xi(¥; — a — Ba;) =0. 


Hence, 


~ ~~ F CY; 
a=Y, — . 
aU; 


On the other hand, the likelihood takes the form 


1 1 
i S> (Yi - a — Bai)?]. 
(Vv 270)” eR Da? : ee Be) 


So, the minimisation of L for fixed o? is equivalent to minimisation of S. 
(b) By a standard calculation we obtain the MLE of o?: 


ee 1 dena 
a = — > (4 -—@— Bas)? 
7 


where @, B are the MLEs of the parameters a, 3. Next, 
Si(% — & — Bas)?/o? ~ x2». 
i 


So, we get 


now 
2(—2 32) = 02 
n—2 


(c) Select the row of n x n matrix to form an orthogonal matrix 


vn... = Alyn 
O=] 4/04 7... nf i i 


Applying this orthogonal transformation we obtain that @, 8 and G? are 
independent. In fact, @ ~ N(a,o?/n), 8 ~ N(B,02/n) and ng? /o? ~ x2_5. 
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(d) Compare the statistics 
Vn(@ — ao) 
no?/(n — 2) 


with the quantile of the Student distribution. The hypothesis is rejected if 
[| she 9 ees 2): 


T= 


Question 5.55 Suppose X1,...,X, are independent N(0, 0c?) RVs, where 


o” is an unknown parameter. Explain carefully how to construct the uni- 


formly most powerful test of size a for the hypothesis Ho: a? = 1 versus the 
alternative H,: 0? > 1. 


Solution The Neyman—Pearson Lemma says the most powerful test of size a 
for testing the simple hypothesis Hg: o = 1 versus the alternative H,: o = o, 
is a likelihood ratio test, i.e. the critical region is of the form 


{L(o1; X)/L(1; X) > c}. 


In this case, 
1 = 
L(oi; X)/L(1; X) x exp (Je — 077) x7) 


and when o; > 1 the right-hand-side is increasing in the statistic >, X?. 
Hence, the form of the rejection region is 


1 OSC ! 
i 
This test has size a if 
a = P(type I error) = Po (= BGs c) =F (CG) 
i 
where F is the distribution function of the x?(n) distribution. Hence 


CaP -1(a): 


Since the form of the test does not depend on the alternative H;: 0 = 04, 
as long as 0; > 1, the test is uniformly most powerful for testing H,: o > 1. 


Question 5.56 Consider the the linear regression model 


Yj — Poebe, 
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where the numbers 21, ... , Z, are known, the independent RVs «1, ... , €n 
have the N(0,02) distribution, and the parameters 6 and o? are unknown. 
Find the maximum likelihood estimator for ~. 
State and prove the Gauss—Markov Theorem in the context of this model. 
Write down the distribution of an arbitrary linear estimator for 6. Hence 
show that there exists a linear, unbiased estimator B for 8 such that 


Eg, o2[(6 — B)4] < Eg.o2[(8 — 8) 


for all linear, unbiased estimators B . 


Hint: If Z ~ N(a,b?) then E[(Z — a)*] = 307. 


Solution The density of Y; is f(y; 6,0) « o—te-(y-Bxi)?/(207) so the log- 
likelihood is 
1 
£(8,0;Y1,...,Yn) = —nlog V2 — nlogo + > Y (Y; — Bai)? 


202 
a 


which is maximised at 


ae 
a 


Theorem [Gauss—Markoy] For all linear unbiased estimators 
B= S- iY; 
i 


of 6, we have 


n 


Eg,ol(8 — B)"] < Egol(6 — B)’].- 


Proof Suppose 0 is an unbiased linear estimator of 0, so that 
0= SY, and SS aia = 0. 
i i 


Then Cov(0, 8) = 0 since 


Cov (= aiY;, 3S om = > Cov(aiYi, 27 Yj) 
i i a,j 
=o? S> AjX4. 
i 
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Now let B be an unbiased linear estimator of 8, so that B = B is an unbiased 
linear estimator of 0. Then 


Var(3) = Var(8 — 8) + 2Cov(8 — 6, 8) + Var(8) 
= Var(@ — 8) + Var(@) 
> Var(8). 


If B = 50, aY; is a linear combination of independent normals, then B is 
normally distributed with mean 85°, civ; and variance 0? >, c?. 
By the hint, 


2 
aid-ayiase (SA) 


But the Gauss-Markov Theorem says >, c? 


is minimised over }), qv; = 1 
when c; = 21;/ >> é a Hence, the minimising estimator is 8, the maximum 


likelihood estimator. 


I mean the word proof not in the sense of the lawyers, who set two half 
proofs equal to a whole one, but in the sense of a mathematician, where a 
half proof is nothing, and it is demanded for proof that every doubt 
becomes impossible. 

K.-F. Gauss (1826-1877), German mathematician and physicist. 


Question 5.57 <A washing powder manufacturer wants to determine 
the effectiveness of a television advertisement. Before the advertisement is 
shown, a pollster asks 100 randomly chosen people which of the three most 
popular washing powders, labelled A, B and C, they prefer. After the ad- 
vertisement is shown, another 100 randomly chosen people (not the same as 
before) are asked the same question. The results are summarised below. 


|A BC 


before | 36 47 17 
after | 44 33 23 


Derive and carry out an appropriate test at the 5 percent significance 
level of the hypothesis that the advertisement has had no effect on people’s 
preferences. 


You may find the following table helpful: 
| xt Xa XS XA XRG 
95 percentile | 3.84 5.99 7.82 9.49 11.07 12.59 
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Solution Model both surveys as draws from trinomial distributions with 
parameters p = (pi,p2,p3) and q = (q,4q2,93) respectively. The log- 
likelihood of the data is then 


N,!.No! 


£(p, q) = log — + 52 Niarlogpi + 5° Nizlog a. 


The null hypothesis is Hg: p = q and Hy: p,q unrestricted. 
Under Ho, the likelihood is maximised over p = q, p1 + p2 + p3 = 1 when 


oe Ni + No 
and under Hy, 
we ON a. Nap 
Pe> and GN, 
The log likelihood ratio is then 
2b =2 u Nj, log se + Nj log <2 — (Ni + Ni.) log sot 


where Ej; = N;(Ni + Ni,2)/(Ni + No) are the expected number of obser- 
vations of the ith category in jth experiment. 

Since Ho has two degrees of freedom, and H, has four, the statistic 2D has 
approximately the x? distribution with 4 — 2 = 2 degrees of freedom. 

(40 — 36)? (40 — 47)? i (20 — 17)? 
40 40 20 

_ (40-44)? (40-33)? | (20 — 23)? 

"40 7 20 
Since 5.15 < 5.99, the hypothesis is not rejected at the 0.95 significance 
level. 


Question 5.58 Let X1,...,X, be independent Exp(@) RVs with unknown 


parameter @. Find the maximum likelihood estimator 0 of 0, and state the 
distribution of n/@. Show that 0/0 has the I'(n,n) distribution. Find the 


100(1—a) percent CI for 6 of the form [0, C6] for a constant C' > 0 depending 
on a. 


2b = 


= 5.15. 


Now, taking a Bayesian point of view, suppose your prior distribution for 
the parameter 6 is ['(k, \). Show that your Bayesian point estimator 0g of 
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6 for the loss function L(@,a) = (@ — a)? is given by 
_ antk 
Find a constant Cg > 0 depending on a such that the posterior probability 
that 6 < Cpé@zp is equal to 1 — a. 
Hint: The density of the I[(k,A) distribution is f(x;k,A) = 
Mg * le 71 (k), for a > 0; 


Op 


Solution Since the density of each X is f(x;@) = 0e~*”, the log-likelihood 
is given by 


0(0; Xy,...,Xn) = nlog@ — 6(X, +---+ Xn) 


which is maximised at 
n 


yi Xi 
The distribution of n/@ = >>, Xi is P(n, 6). Hence 6/8 ~ (8/n)T'(n, 0) = 
T(n,n). . 

We must find C such that P9(Cé < 6) < a for all 6. 


0 = 


Po(CO < 0) = Po(0/0 > C) 
= F(C;n,n), 


where F(-;k, X) is the upper tail of the I'(k, A) distribution funcion. In par- 
ticular, we may take 


C =F (a;n,n) = nF (a;n,1). 
The prior density of © is 
m(0) x OF-1e779, 
Hence the posterior density is 
m(0;.X1,...,Xn) x w(O) F(X, - 65 XnlO) 0c OPM Ve OF Ds Xi), 


The posterior distribution of O is [(k +n,A+ 0, Xi). 

The Bayesian point estimator for the quadradic loss function is the pos- 
terior mean. But the mean of the [(k + n,A + >>, Xi) distribution is 
(k+n)/(A+ >>, Xi) as claimed. 

Note that 0/0p has the [(k + n,k +n) distribution. Hence 


Cp =F (a;kt+n,k+n)=(k4+n)F Uajk+n,1). 
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Question 5.59 Consider a collection X1,..., Xp of independent RVs with 
common density function f(a;@) depending on a real parameter 6. What 
does it mean to say T is a sufficient statistic for 0? Prove that if the joint 
density of X),...,X, satisfies the factorisation criterion for a statistic T, 
then T is sufficient for 0. 


Let each variable X; be uniformly distributed on [—V0, V0]. Find a two- 
dimensional sufficient statistic T = (T,,T7). Using the fact that @ = 3X? 
is an unbiased estimator of 0, or otherwise, find an unbiased estimator of 6 
which is a function of T and has smaller variance than @. 


Solution <A statistic T is sufficient if the conditional distribution of 
X1,...,Xn given T does not depend on 6. 


The factorisation criterion reads 
Gi OOS oF aii ee Rees) 


for functions g and h. Given this, the joint density of X1,..., Xn given T =t 
is 
g(t; )h(x) h(x) 
Sy:ry)=t g(t; O)h(y) dy Syr()=t h(y)dy 


which does not depend on @. 
Since f(x; 0) = al |-vo.val (x), the joint PDF is 
f(z; seey Dn; 0) — aa eas He | eR Te 


so by the factorisation criterion, T = (min X;, max X;) is sufficient for 0. 


By the Rao-Blackwell Theorem, if é is an unbiased estimator of @ and if 


T is sufficient, then E(6|T = t) is an unbiased estimator of 0 with smaller 
variance. In this case, 


E[3.X?| min X; = m, max X; = N] 


1 1 =) uM 
= ~3m? 4+ =3M? +4" aoa | 3y2dy 
n n n Mm) Im 


1 
—[3m? + 3M? + (n — 2)(M? + Mm + m?)| 


= —[(n+1)m? + (n—1)mM +4 (n+ 1)M?]. 


Tripos examination questions in IB Statistics 4A7 


Each of us has been doing statistics all his life, 

in the sense that each of us has been busily reaching 
conclusions based on empirical observations 

ever since birth. 

W. Kruskal (1919-2005), American statistician. 


We finish this volume with a story about F. Yates (1902-1994), a promi- 
nent UK statistician and a close associate of Fisher (quoted from [151], 
pp. 204-205). During his student years at St John’s College, Cambridge, 
Yates had been keen on a form of sport which had a long local tradition. 
It consisted of climbing about the roofs and towers of the college buildings 
at night. (The satisfaction arose partly from the difficulty of the climbs and 
partly from the excitement of escaping the vigilance of the college porters.) 
In particular, the chapel of St John’s College has a massive neo-Gothic tower 
adorned with statues of saints, and to Yates it appeared obvious that it would 
be more decorous if these saints were properly attired in surplices. One night 
he climbed up and did the job; next morning the result was generally much 
admired. But the College authorities were unappreciative and began to con- 
sider means of divesting the saints of their newly acquired garments. This 
was not easy, since they were well out of reach of any ordinary ladder. An 
attempt to lift the surplices off from above, using ropes with hooks attached, 
was unsuccessful, since Yates, anticipating such a move, had secured the sur- 
plices with pieces of wire looped around the saints’ necks. No progress was 
being made and eventually Yates came forward and, without admitting that 
he had been responsible for putting the surplices up, volunteered to climb 
up in the daylight and bring them down. This he did to the admiration of 
the crowd that assembled. 

The moral of this story is that maybe statisticians should pay more at- 
tention to self-generated problems .... (An observant passer-by may notice 
that presently two of the statues on the St John’s chapel tower have staffs 
painted in a pale green colour, which obviously is not an originally intended 
decoration. Perhaps the next generation of followers of Fisher’s school are 
practicing before shaping the development of twenty-first century statistics.) 


Appendix 


Tables of random variables and 
probability distributions 


Table A.1 Some useful discrete distributions 


Family 
Notation PMF P(X =r) Mean Variance PGF Es* 
Range 
Poisson Gly 
Po(A) . - ~ d ed(s-1) 
O;1,. 5. ty 
Geometric 1% = ‘ 
Geom (p) p(l—p)” 5 OF 
0.1 Pp P 1— s(1—p) 
Binomial i 
Bin(n,p) — (")p"(1—=p)r-"_— np np(1— p) [ps + (1—p)]" 
0,...,7 is 
Negative 
Binomial Ces ‘) k(1 — p) k(1—p) P | k 
NegBin(p, k) xpF(1 — p)” Pp p? 1—s(1—p) 
OF ses 
Hyper- D\ /N—D 
geometric as oemae) nD nD(N — D)(N — n) —D, —n 
Hyp(N, D,n) WN N N2(N — 1) 2F\ —N; l—s 
(n+ D—N)y, () 
...,DAn 
cia 1 n+1 n?—1 s(1—s”) 
70 = SS = ae 
Occaht n 2 12 n(1— s) 
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Table A.2 Some useful continuous distributions 


Family tx 
Notation PDF fx (a) Mean Variance pas ger 
Range . 
ebt = eat 
Uniform 1 a+b (b— a)? (b—a)t 
U (a,b) pan 3 Ta eibt _ piat 
(a,b) oe 
(b—a)t 
E tial a 
xponentia 
1 1 rA-t 
Exp(A re AT nes oom 
a ) e r d2 » 
rA— it 
aa Ya ga—1e—Ae im wi (1 — t/A)~% 
am(a, = ao 
Ry (a) » ? (1 — it/d)-% 
1422 
ee exp [-sbe(e- u)?] : exp(tu + 5t°o*) 
N(u, 07) 5 u g 
R V 210 exp(itu — $707) 
Multivariate 
T 147 
Nenad exp[—4 (x — p)TE“1(x — p)] ’ e exp(t* + 5t* Xt) 
> * 
nee ) (2x) det 5 exp(itTw — tT Zt) 
mes T not not E(e**) = 00 for t #0 
oe a(r? + (a — a)?) defined defined E(e!#*) = exp(ita — |t|r) 
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Index 


absolute error, 296 covariance matrix, 224 

AM-GM inequality, 73, 116 Cramér—Rao (CR) inequality, 281 
ANOVA (analysis of variance), 350 critical region (of a test), 304 
asymptotic normality, 263 size of, 305 


ballot problem, 44 cumulative distribution function (CDF), 


Bayes risk, 297 ’ i 
Bayes’ Theorem, 10 joint,.156 
Bertrand’s Paradox, 136 
Beta distribution, 258 
Beta function, 258 
branching process, 122 
critical, 124 
subcritical, 124 
supercritical, 124 


decision rule, 297 
risk under, 297 
De Moivre—Laplace Theorem (DMLT), 105 


error 
absolute (loss function), 296 
mean square (MSE), 263, 276 
quadratic (loss function), 296 


Cantor’s staircase, 153 standard, 276 
Cauchy distribution, 143 type I, 305 
Cauchy—Schwarz (CS) inequality, 54 type II, 305 
Central Limit Theoreom (CLT), 102 estimate, 262 
characteristic function (CHF), 83 estimation 


joint, 226 
Chebyshev inequality, 98 
Chernoff’s inequality, 99 
chi-square or y? distribution, 254 
chi-square or x? statistic, 321 
chi-square or y? test, 321 
comparison of variances, 348 
complete probability, 11 
concave, 99 
confidence interval (CI), 289 


conjugate family (of PDFs/PMFs), 295 


consistency, 263 
contingency table, 336 
convex, 99 

convolution formula, 56 


correlation coefficient (of two RVs), 189 


sample, 261 
covariance (of RVs), 188 
covariance (of two RVs), 54 


interval, 251, 289 
parameter, parametric, 263 
point, 262, 289 
estimator, 262 
best linear unbiased (BLUE), 358 
least squares (LSE), 357 
maximum likelihood (MLE), 271 
minimum (MSE), 281 
optimal Bayes, 296 
unbiased, 266 
event, 3 
exceedance, 208 
exclusion—inclusion formula, 39 
expectation (of an RV), 48 
conditional, 50 
expected value, see expectation 
exponential distribution, 141 
exponential family, 286 
extinction probabilities, 125 
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factorisation criterion, 268 
Fisher F-distribution, 257 
Fisher information, 281 
Fisher Theorem, 274 
formula 
Bayes, 10 
complete probability, 11 
conditional expectation, 50 
convolution, 56 
exclusion—inclusion, 39 
Stein, 238 
Stirling, 107 
Fourier transform, 194 
inverse, 194 
function 
Beta, 258 
concave, 99 
convex, 99 
Gamma, 142 
log-likelihood, 271 


Galton—Watson process, 132 
Gambling Paradox, 65 
Gamma distribution, 142 
Gamma function, 142 
Gauss hypergeometric function, 313 
Gaussian distribution, 141 
multivariate, 144 
general linear model, 359 
generalised likelihood ratio (GLR), 327 
generalised likelihood ratio test (GLRT), 327 


HM-GM inequality, 116 
hypergeometric distribution, 313 
hypothesis 

alternative, 304 

composite, 312 

conservative, 305 

null, 304 

one-side, 315 

simple, 304 
hypothesis testing, 251, 304 


idempotent matrix, 360 
independence, 18 
independent events, 18 
independent, identically distributed random 
variables (IID RVs), 52 
independent observations, 206 
independent random variables, 35 
continuous, 167 
discrete, 51 
inequality 
AM-GM, 73, 116 
Cauchy—Schwarz (CS), 54 
Chebyshev, 98 
Chernoff’s, 99 
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Cramér—Rao (CR), 281 
HM-GM, 116 
Jensen’s, 99 
Markov, 98 

interval 
confidence, 289 
prediction, 365 

interval estimation, 289 


Jacobian, 168 
Jensen’s inequality, 99 
joint CDF, 156 

joint PDF, 156 


Laplace transform, 83 
Law of Large Numbers (LLN), 99 
strong, 102 
weak, 101 
Law of the Unconscious Statistician, 49 
Lebesgue integration, 74, 140, 145, 185 
Lemma 
Borel—Cantelli (BC), first, 163 
Borel—Cantelli (BC), second, 164 
likelihood function (LF), 271 
likelihood ratio (LR), 300, 306 
generalised (GLR), 327 
monotone (MLR), 312 
likelihood ratio test, 308 
linear regression, 356 
simple, 357 
with constraints, 366 
log-likelihood (LL) function, 271 
log-normal distribution, 214 
loss function (LF), 296 
absolute, 296 
quadratic, 296 
loss, posterior expected, 296 


Markov inequality, 98 
matrix 

idempotent, 360 

inverse, 144 

invertible, 144 

orthogonal, 221, 224 

positive-definite, 144 

recursion, 24 

strictly positive-definite, 144 
maximum likelihood estimator (MLE), 271 
maximum likelihood, method of, 271 
mean, see mean value 

sample, 263 
mean square error, 263, 276 
mean value (of an RV), 48 

posterior, 296 

sample, 110 
mean-value vector, 224 
measurability, 140 
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measure, 140 
median 
posterior, 297 
median (of an RV), 149 
sample, 170 
memoryless property, 77, 163 
method 
of maximum likelihood, 271 
of moments, 279 
minimal sufficient statistic, 269 
mode (of an RV), 150 
moment (of an RV), 83 
moment generating function (MGF), 82 
moments, method of, 279 
monotone likelihood ratio (MLR), 312 
multinomial distribution, 337 
multivariate Gaussian distribution, 144 
multivariate normal distribution, 144 


Neyman-—Pearson (NP) Lemma, 307 

normal distribution, see Gaussian 
multivariate, 144 

NP test, 308 

nuisance parameter, 328 


optimal Bayes estimator, 296 
ordered statistic, 196 
outcome, 3, 110 


Paradox 

Bertrand, 136 

Gambling, 65 

Simpson, 340 

St Petersburg, 65 
parametric estimation, 263 
Pearson statistic, 321 
Pearson test, 321 
Pearson’s Theorem, 322 
percentile, 259 
Pochhammer symbol, 313 
point estimate, 262, 289 
posterior expected loss, 296 
posterior median, 297 


power of a test, 306 
prediction interval, 365 
prior probability or PDF/PMF, 11, 294 
probability density function (PDF), 141 
conditional, 163 
continuous, 163 
unimodal, bimodal, multimodal, 202 
probability distribution, continuous 
Beta, 258 
Cauchy, 143 
chi-square or x?, 254 
exponential, 141 
Fisher F-, 257 


posterior probability or PDF/PMF, 11, 294 
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Gamma, 142 
Gaussian or normal, 102 
jointly, 153 
multivariate, 144 
log-normal, 214 
Simpson, 144 
Student t, 255 
uniform, 137, 141 
probability distribution, discrete 
binomial, 75 
geometric, 64, 76 
hypergeometric, 313 
multinomial, 337 
negative binomial, 77 
Poisson, 78 
uniform, 70 
probability generating function (PGF), 
79 
probability mass function (PMF), 251 
probability of an outcome or event, 3 
conditional, 10 
posterior, 11 
prior, 11 


probability space (space of outcomes, sample 


space, total set of outcomes), 26, 34 


quadratic error, 296 
quantile, 259 

lower, 259 

upper, 259 


random variable (RV), 47 
randomised test, 308 
Rao-—Blackwell Theorem, 277 
reflection principle, 45 
regression 

linear, 356 

simple, 357 
regression line, 357 
relative entropy, 107 
residual sum of squares (RSS), 360 
Riemann integration, 74 
Riemann zeta-function, 74 
risk, 297 

Bayes, 297 


sample, 110 
sample statistic, see statistic 
significance level of a test, 305 
Simpson distribution, 144 
Simpson Paradox, 340 
size of the critical region (of a test), 305 
Snedecor—Fisher distribution, 257 
St Petersburg Gambling Paradox, 65 
standard deviation, 54 

sample, 276, 317 
standard error, 276 


statistic, 267 

Fisher F-, 348 

ordered, 196 

Pearson, chi-square, or x”, 321 

Student t-, 317 

sufficient, 229, 267 

minimal, 269 

statistical tables, 103 
Stein’s formula, 238 
Stirling’s formula, 107 
Student t-distribution, 255 
Student t-statistic, 317 
Student t-test, 317 
sufficient statistic, 229, 267 
sum of squares 

between groups, 352 

residual, 360 

total, 352 

within group, 351 
support of a PDF, 153 


tail probability, 76 
test 
analysis of variation (ANOVA), 350 
critical region of, 305 
Fisher F-, 348 
generalised likelihood ratio (GLRT), 
327 
independence, 335 
likelihood ration (LR), 308 
most powerful (MP), 306 
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Neyman-Pearson (NP), 308 
paired samples, 318 
Pearson, chi-square, or x”, 321 
power of, 306 
randomised, 308 
significance level of, 305 
size of, 305 
Student t-, 317 
uniformly most powerful (UMP), 
312 
Theorem 
Bayes’, 10 
Central Limit (CLT), 102 
integral, 102 
local, 105 
De Moivre—Laplace (DMLT), 105 
Fisher, 274 
Pearson, 322 
Rao-Blackwell (RB), 277 
Wilks, 327 
total set of outcomes, 9 
type I error probability (TIEP), 305 
type II error probability (TIHEP), 305 


unbiasedness, 263 
uncorrelated random variables, 54 
uniform distribution, 137, 141 
variance, 53 

sample, 266 


Wilks’ Theorem, 327 


